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personal  computers. 

The  bilingual  operating  system  compatibility  as  well  as 
the  Arabic  characters'  code  values  is  investigated.  The 
Latin  code  is  fed  into  a  computer  to  be  compiled  and  run 
with  a  Latin  interpreter  (i.e.,  Turbo  PASCAL) ,  in  an  Arabic 
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I.  INTRODUCTION 


The  English  language  is  the  most  popular  scientific 
language  used  today.  The  language  descended  from  Latin  and 
has  had  wide  use  in  the  scientific  field.  The  English 
alphabet  is  familiar  to  people  in  Europe  and  all  countries 
who  use  languages  descended  from  Latin.  There  are  slight 
changes  between  the  various  alphabets  that  have  descended 
from  Latin. 

The  wide  use  of  Latin  alphabets  has  made  it  easy  to  set 
standards  for  typewriters  and  console  keyboards.  The 
similarity  in  grammar  common  to  most  of  them,  their  fonts 
and  direction  of  flow  (i.e.,  left  to  right)  has  made  it  easy 
to  standardize. 

Keep  in  mind  that,  many  of  the  computer  pioneers  made  an 
effort  not  to  limit  the  implementation  of  their  software  to 
one  spoken  language.  Software  is  the  key  to  any  limited  use 
of  computers  in  any  language.  Typically  lack  of  knowledge 
of  programmers  in  a  foreign  language  limits  their  ability  to 
write  applications  acceptable  to  the  user.  Not  so  many 
nations  are  blessed  with  the  computer  development 
technology.  However  all  nations  have  people  who,  as  users, 
are  capable  of  contributing  to  humanity  using  this 
technology. 


Given  the  technology  existing  today,  if  we  can  create  an 
interface  between  a  host  foreign  language  and  a  target 
application  language  there  will  be  fewer  barriers  to  nations 
that  do  not  use  a  standard  English,  French,  or  German-based 
computer  operating  systems  and  software.  The  interface  will 
accept  user  commands  from  the  host  environment  and  translate 
it  to  the  syntax  of  the  target  environment.  It  is  assumed 
that  the  user  is  knowledgeable  in  the  semantics  of  the 
target  environment  in  his  spoken  language  terms. 

The  question  may  be  asked,  "what  good  will  this  approach 
do  such  a  nation?"  There  are  several  good  points.  Two  of 
the  most  important  reasons — One,  there  is  a  good  library  of 
software  that  exists?  and  two,  the  price  of  software  (even 
with  the  addition  of  an  interface  communicator)  is  less  than 
newly-written  customized  software.  It  is  faster  and  easier 
to  write  an  interface  than  to  rewrite  a  large  body  of 
software. 

Two  user  environments  should  not  be  confused.  The 
customized  foreign  alphabets  used  in  many  countries  on 
mainframes  for  specific  applications  are  developed  by 
contractors  who  are  expert  in  that  application  but  not 
necessarily  the  foreign  language.  The  mainframes  must  use 
the  software  provided  by  the  original  contractors.  It  takes 
a  lot  of  effort  and  capital  to  develop  new  software 
application  for  the  special  machine.  This  limits  the  use  of 
the  computer  to  operators  and  data  entry  personnel  with 


minimum  creative  programming  from  the  user  side.  Users  do 
not  share  the  expertise  of  others  and  the  continuously 
improving  software.  This  is  because  there  are  limited  users 
and  minimum  feedback  to  software  developers. 

The  second  user  environment  is  the  average  user  who  has 
some  scientific  background  but  has  no  access  nor  the  capital 
to  invest  in  mainframe  hardware.  This  user  is  often  an 
educator,  student,  or  a  professional.  This  category  of 
users  has  great  potential.  The  use  of  software  with  a 
native  language  interface  would  be  very  helpful  and  afforda¬ 
ble  at  the  same  time  to  this  group.  This  group  is  very 
capable  of  contributing  in  their  respective  fields  with  the 
powerful  processing  features  available  with  personnel 
computer  technology  today. 

This  thesis  is  concerned  with  the  second  user  environ¬ 
ment  for  several  reasons.  The  second  group  of  users  are  the 
creative  ones.  Their  understanding  of  computers  and  its 
applications  is  a  major  step  toward  building  the  target 
machine  with  compatible  native  standards.  This  will  elimin¬ 
ate  the  ad  hoc  design  by  the  contractor  who  most  of  the  time 
has  to  hire  a  non-technical  translator  and  dictate  to  them 
the  language  specification,  key  words,  and  commands  of  the 
operating  system,  or  query  language.  Usually  a  translator 
will  translate  the  machine  native  language  key  words  to  the 
target  language  using  its  alphabet.  The  translator  may  have 
minimal  programming  or  computer  experience.  This  will  most 


likely  lead  to  an  ambiguous  environment  for  users  to  work 
with. 

The  feasibility  of  such  an  approach  is  constrained  by 
several  factors.  The  language  or  the  user  environment  is 
one  factor.  How  is  the  language  implemented  or  emulated  on 
standard  Latin  language  hardware?  The  target  machine  (i.e., 
micro  to  mini  computers)  compatibility  with  others  in  the 
same  family  is  also  a  factor.  These  are  factors  that  affect 
feasibility.  Economical  feasibility  is  based  on  demand  and 
supply  and  a  developer  must  evaluate  the  benefit  vs.  the 
development  cost  in  order  to  develop  such  interface 
software . 

The  Arabic  language  is  a  very  rich  language  in  vocabu¬ 
lary  and  historical  background.  The  Arabian  alphabet  is 
very  old.  The  language  was  used  for  several  centuries  by 
leading  ancient  mathematicians,  physicians,  biologists,  and 
chemists.  They  successfully  contributed  in  their  fields 
using  the  Arabic  alphabet.  Their  numerals,  symbols,  and 
equations  were  all  written  in  Arabic.  However  this  does  not 
make  it  simple  to  use  the  Arabic  alphabet  in  the  modern 
computer  environment. 

One  reason  is  that  the  direction  of  flow  in  reading  and 
writing  is  from  right  to  left.  Secondly,  Arabic  characters 
are  not  printed  like  Latin  characters.  Arabic  words  are 
printed  like  calligraphy.  Arabic  characters  must  be  either 
written  in  stand  alone  or  connected  form.  The  character  may 
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be  located  in  one  of  three  ways:  at  the  beginning  of  a 
word,  in  the  middle,  or  at  the  end  of  a  word.  With  a  set  of 
complicated  rules  the  shape  of  a  character  is  determined  by 
its  location  with  respect  to  the  word.  This  difficulty  has 
complicated  attempts  to  provide  a  software  emulation  to  the 
Arabic  environment  in  personal  computers. 

The  goal  of  this  thesis  is  to  provide  an  approach  to 
solving  this  problem.  The  steps  that  must  be  followed  will 
be  described  in  addition  to  special  consideration.  To  show 
that  translation  is  possible,  we  will  develop  an  interface 
to  communicate  between  an  Arabic  form  of  source  code  in  the 
PASCAL  language  and  an  existing  English  PASCAL  compiler. 
The  interface  will  use  sample  source  code  written  in  Arabic 
and  Lexically  Translate  it  to  English  source  code.  The  goal 
is,  given  correct  Arabic  source  code,  the  interface  will 
produce  correct  English  source  code.  This  should  be  done 
once.  Once  the  program  is  compiled  the  interface  step  is  no 
longer  needed  with  the  compilation. 
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II.  BACKGROUND  ON  ARABIC  CHARACTER 

A.  INTRODUCTION 

There  are  28  basic  characters  in  the  Arabic  alphabet 
(Figure  1) .  However,  these  basic  characters  are  not 
sufficient  for  use  with  computers  or  typewriters. 
Authorities  agree  [Ref.  1]  that  the  optimum  set  should  use  a 
minimum  of  31  characters  (Figure  2)  ,  three  more  characters 
than  the  original  set.  The  additional  3  characters  are 
needed  to  constitute  the  optimum  set  for  representing  Arabic 
texts.  One  may  check  the  Kufic  script,  which  is  over  1500 
years  old,  to  realize  that  engravings  by  ancient  Arabs  were 
done  with  close  to  31  characters.  Each  character  has  one 
shape.  Over  the  years,  variations  of  the  characters  have 
developed  for  ease  of  writing  and  reading.  Each  character 
may  have  from  two  to  five  shapes  depending  on  its  location 
within  a  word.  All  applications  must  use  these  variations 
as  standards  to  represent  Arabic  texts.  Implementing  the 
variation  is  critical  for  compatibility  issues.  Code 
representation  of  any  variation  must  follow  a  strict 
standard  to  insure  survival  among  other  implementations. 

The  Arabic  alphabet  has  only  three  vowels  in  the  28 
characters  (see  Figure  3  for  the  alphabet  names) .  Voweli- 
zation  is  also  performed  through  the  use  of  diacritics  (see 
Figure  4).  Most  Arabic  texts  do  not  show  diacritics. 


Readers  have  learned  to  read  and  understand  the  word  based 
on  the  context  of  its  use.  If  misinterpretation  is 
critical,  verifications  are  provided  in  parentheses.  Most 
applications  today  do  not  require  diacritic  symbols. 

The  Arabic  numerals  and  Hindu  are  used  in  the  Arabic 
world.  North  African  countries  use  the  Arabic  numerals  (as 
used  in  Latin)  .  The  Arabic  name  is  given  to  the  numerals 
used  in  Latin,  and  Hindu  numerals  are  used  by  most  of  the 
Arabic  world  (Figure  5)  .  However,  history  books  show  that 
both  systems  originated  in  India.  The  Arabic  language  uses 
the  Latin  comma  for  a  decimal  digit  to  be  distinguished  from 
the  Arabic  number  zero  which  is  the  Latin  decimal  digit 

B.  ARABIC  LANGUAGE 

The  Arabic  language  differs  from  languages  descended 
from  Latin  in  several  ways.  The  primary  differences  are: 

*  Arabic  is  written  right  to  left  instead  of  left  to 
right. 

*  The  representation  of  vowels  by  using  diacritics  in  the 
form  of  over  or  under  scores  with  most  letters  within 
the  words. 

Secondary  differences  are: 

*  Letters  in  Arabic  may  be  joined  or  not  according  to 
location  within  the  word.  A  particular  letter  may  be 
joined  to  the  preceding  letter,  and/or  following  letter. 

*  Each  letter  has  between  two  and  five  different  forms 
dependent  on  its  contextual  position. 

Lexically  the  Arabic  language  can  be  defined  in  BNF 
notation  as  follows  [Ref.  l:p.  28]: 


<language>  ::=  (<sentence> 

* 

<sentence>  :  :=  {<word>}1 

★ 

<word>  ::=  { <charactersxvoc .  spxcharacter  > 

<character>: :=  {  see  Figure  1.  }* 

<voc.sym>  : :=  {  see  Figure  4  .  }* 

C.  WRITING  ARABIC 

Writing  in  Arabic  flows  from  right  to  left,  additional 
lines  start  from  right  to  left  beginning  below  the  previous 
line.  A  word  is  entered  by  typing  the  first  character  at 
the  cursor  position  followed  (to  the  left)  by  the  next 
character.  An  example  of  this  is  the  word  "hello."  If  the 
same  word  is  entered  in  Arabic  it  will  be  entered  as 


follows: 

cursor  position  -  _< 

step  1.  enter  character  "h" - _h< 

step  2.  enter  character  "e" - _eh< 

step  3.  enter  character  "1" - _leh< 

step  4.  enter  character  "1" - _lleh< 

step  5.  enter  character  "o" - _olleh< 


This  demonstrates  the  direction  of  flow,  however  if  one 
should  worry  about  each  character  shape,  it  may  seem  tedious 
for  long  text.  In  some  applications  one  must  provide  dia¬ 
critics  also.  In  short,  typing  one  vocalized  word  seems 
like  a  puzzle. 

There  are  rules  governing  the  shape  (form)  of  the  letter 
based  on  its  contextual  position. 


M 
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Dewachx,  Abdulilah  [Ref.  l:p.  27]  has  the  following 
opinion  on  the  rules: 

These  rules  have,  in  my  opinion,  been  developed  for  ease 
of  handwriting  and  have  no  bearing  on  the  semantic  and/or 
syntactic  requirement  of  the  language. 

In  spite  of  the  cause  or  the  reason  for  the  development  of 

the  rules,  all  books,  newspapers,  and  magazines  in  the  Arab 

countries  today  are  written  using  those  rules.  They  will 

also  stay  that  way  for  years  to  come. 

Arabic  letters  are  cursive  in  shape.  The  implementation 
of  the  alphabets  is  highly  dependent  on  how  legitimate  the 
characters  look.  The  cursive  nature  of  characters  requires 
that  both  monitor  and  graphic  adapter  provide  good  resolu¬ 
tion.  High  resolution  is  also  required  for  supporting 
correct  vocalization,  as  previously  discussed. 


D.  ARABIC  NUMERALS 

Both  the  eastern  Arabic  numerals  and  the  western  Arabic 
numerals  (Figure  5)  are  used.  Countries  like  Algeria, 
Morocco  and  Tunisia  use  the  western  Arabic  numerals.  The 
numeral  system  is  not  a  critical  issue  since  in  both  repre¬ 
sentations  they  have  the  same  value. 

Many  people  believe  that  the  Arabs  write  the  numbers 
from  left  to  right.  This  is  a  misconception.  The  language 
books  and  schools  teach  the  classical  way  of  writing  and 
reading  the  numerals.  The  classical  way  is  to  either  use 
the  words  ( "one" , "two" , . . . )  or  the  numbers  ("1" , "2" , . . . )  in 
writing  starting  from  right  to  left.  For  example  the  number 
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523  will  be  written  in  Arabic  as  "three  and  twenty  and  five 
hundred."  It  may  sound  wrong  in  English  composition  but 
this  is  the  syntax  that  classical  books  use.  This  method 
should  be  encouraged.  This  is  also  followed  in  reading  the 
numbers . 

The  most  common  method  in  handwriting  numbers  is  to 
write  in  the  order  they  are  said.  An  example  of  how  numbers 
are  read  and  written  today  is  the  year  1986 — pronounced  as 
"One  thousand  nine  hundred  six  and  eighty.  Notice  the  six 
comes  before  the  eighty.  Writing  the  number  "1986)  using 
numbers  is  done  as  follows: 


first  digit  1 _ 

second  digit  19 _ 

third  digit  19_6 

fourth  digit  1986 


This  method  is  far  too  complicated  to  be  adopted  by  mechani¬ 
cal  machines.  The  classical  method  should  be  encouraged  for 
another  obvious  reason.  The  numbers  are  entered  least 
significant  bits  first  in  low  memory.  From  the  computer 
hardware  point  of  view  the  adders/subtractors  may  work  on 
the  number  before  the  complete  number  is  loaded  [Ref.  1]. 
This  is  the  more  efficient  way.  Also  both  numbers  and 
strings  will  be  right  justified. 

This  chapter  has  outlined  the  major  concerns  and  differ¬ 
ences  between  the  Arabic  and  Latin  alphabet.  There  are  a 
few  more  things  worth  noticing.  The  opening  brackets  "[", 
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"{" z  and  "("  are  the  closing  brackets  m  Arabic  and  vice 
versa.  The  Arabic  question  mark  has  the  same  look  as  ”?" 
but  rotated  180  degrees  around  its  vertical  center.  A  list 
of  a  complete  code  set  including  special  characters  is 
included  in  the  ARCH  code  set  (Appendix  D)  .  ARCII  will  be 
discussed  in  detail  in  later  chapters. 


III.  CONTEXTUAL  PROBLEMS  IN  ARABIC  WORDING 


For  any  computer  to  work  in  Arabic  it  must  also  be  able 
to  handle  English  alphabets.  Arabic  users  will  pay  a  few 
extra  dollars  to  add  the  bilingual  features  in  purchasing  a 
computer.  The  form  of  the  bilingual  feature  is  a 
controversial  issue.  This  chapter  will  show  why  one  should 
be  concerned  in  using  mixed  mode  or  even  alternative  between 
the  two  alphabets — Latin  and  Arabic. 

There  are  three  major  differences  between  alphabets 
descended  from  Arabic  and  Latin.  The  differences  are 
direction  of  flow,  diacritics,  and  variant  location  shape  of 
characters.  These  issues  are  specific  to  the  language. 
This  chapter  will  discuss  these  issues  with  respect  to  the 
computer  environment. 

Each  difference  requires  special  attention  in  an  Arabic 
alphabet  implementation  in  hardware.  The  direction  of  flow 
in  reading  and  writing  is  very  complicated  for  users  and 
developers  alike.  This  is  especially  true  where  the 
keyboard,  the  display,  and  the  printer  are  to  operate  in 
bilingual  mode.  Arabic  is  read  and  written  in  the  opposite 
direction  to  Latin.  The  difficulty  is  when  the  user  wants 
to  flip  to  the  other  mode  for  another  application,  or  within 
the  same  applications  the  user  wishes  to  mix  both  character 
sets . 
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A  boom  in  the  introduction  of  electronic  computing  to 
the  Arabic  world  lead  manufacturers  to  make  short  cuts  to 
meet  the  complicated  needs  of  the  Arabic  alphabet.  Also  the 
Arabic  alphabet  is  used  in  several  countries  with  non-Arabic 
languages.  This  wide  use  invited  companies  to  quickly 
develop  a  character  set  for  Arabic,  based  on  limited 
research.  As  a  result  important  language  needs  such  as 
diacritics  were  avoided.  This  also  has  lead  to  a  delay  in 
the  realization  of  an  effective  solution. 

The  contextual  problems,  that  is,  the  variant  shape  of 
characters,  is  the  most  difficult.  To  establish  a  solution 
is  to  decide  the  style  or  the  method  that  developers  should 
follow  in  implementing  Arabic  character  sets.  The  problem 
is  the  complexity  of  providing  to  the  user  all  shapes  possi¬ 
ble  for  the  28  character  set  on  the  keyboard.  Each  charac¬ 
ter  has  between  two  to  four  shapes,  making  for  a  total 
requirement  of  84  codes  to  represent  the  minimum  set  of  the 
Arabic  alphabet.  This  number  is  higher  by  50  percent  than 
what  the  English  alphabet  (upper  and  lower  case)  requires. 
The  rest  of  the  special  characters  and  diacritics  require 
more  codes.  In  some  cases  the  applications  of  diacritics  to 
some  charac-ters  requires  a  unique  shape  to  represent  it. 
This  requires  a  unique  code  for  the  combination  of 
characters  and  diacritics.  The  use  of  "Hamzah"1  also 

^-The  "hamzah"  is  one  of  the  three  characters  that  were 
added  to  the  alphabet  in  addition  to  the  original  character 


requires  special  attention  when  used  with  any  of  the  three 
vowels  in  the  alphabet.  The  limited  number  of  codes  the 
keyboard  has  is  the  limiting  factor  for  planning  the  code 
assignments.  A  look  at  some  efforts  and  proposals  will  be 
discussed  in  the  following  chapter. 

A.  DIRECTION  OF  FLOW 

Working  in  mixed  mode  is  considered  a  must  in  the  Arabic 
environment.  There  are  two  approaches  to  handle  the  mixed 
modes  data  entry  and  storage  problem.  One  approach  calls 
for  the  data  to  be  stored  in  aural  order  (i.e.,  logical 
order)  .  The  second  approach  is  to  store  the  data  in  the 
same  order  as  it  looks  (i.e.,  visual  order).  Keep  in  mind 
that  if  an  Arabic  word  is  inserted  in  English  text  the  last 
character  of  the  word  will  be  encountered  first,  scanning 
from  left  to  right. 

One  approach  places  the  burden  on  the  display  to 
translate  the  incoming  data  to  the  correct  direction  to  be 
displayed.  The  display  must  translate  an  escape  code  or  a 
mode  bit  sent  with  the  data.  The  easiest  method  is  to  set  a 
high  bit  (if  it  is  not  used)  as  to  whether  the  character  is 
Arabic  or  Latin.  This  option  calls  for  smart  display 
devices . 

The  second  approach  is  to  store  the  data  in  aural  order. 
This  approach  places  the  burden  on  the  computer  to  determine 
how  to  store  data  to  cause  no  shifting  of  display  direction. 
This  means  the  display  program  will  keep  track  of  the 


language  mode  and  do  order  reversing  to  store  the  data  in  an 
appropriate  order.  In  handwriting,  handling  mixed  modes  is 
done  in  the  following  fashion: 

-  continue  typing  until  reaching  a  foreign  character. 

-  count  the  number  of  spaces  occupied  by  foreign 
characters  up  to  the  first  native  character. 

-  skip  that  number  of  spaces  and  write  back  to  where  you 
stopped  before  skipping.  When  done  the  writer  should 
end  where  he/she  jumped  from. 

-  skip  the  same  number  of  spaces  you  counted.  This  is 
where  the  next  native  character  belongs. 

It  seems  that  humans  can  do  this  routine  more  easily  than 
computers.  The  computer  can  only  deal  with  incoming  data  as 
it  arrives,  one  character  at  a  time.  This  means  the 
computer  does  not  know  in  advance  how  many  foreign  charac¬ 
ters  are  coming.  The  computer  can  use  a  logical  device 
called  a  stack.  Characters  of  different  mode  are  stored 
(pushed  on  the  stack)  up  to  the  next  native  character.  At 
this  point  the  computer  has  the  foreign  string  in  reverse 
order  on  the  stack.  In  the  next  step  the  computer  starts  to 
write  from  the  top  of  the  stack  until  no  more  characters  are 
in  the  stack.  Then  the  program  continues  with  the  last 
encountered  native  character.  In  this  approach  the 
direction  of  flow  for  the  display  is  maintained.  Obviously 
this  method  has  several  disadvantages.  One,  it  slows  the 
storing  of  data  in  mixed  mode.  Two,  it  slows  the  computer 
from  doing  other  functions,  where  a  smart  display  could 


handle  the  display  of  mixed  mode  data  as  they  are  stored 
logically. 

The  approach  that  should  be  taken  is  connected  with 
resolving  the  contextual  issue,  the  variant  character  shape 
problem. 

B.  ARE  DIACRITICS  REQUIRED? 

By  linguistic  standards  the  omission  of  diacritics  by 
computers  murders  the  Arabic  language.  Linguists  have 
always  officially  criticized  the  mispronunciation  of 
statements  by  television  and  radio  people.  The  use  of  dia¬ 
critics  is  a  must  in  the  language  even  by  recommendation  of 
westerners  involved  with  the  Arabic  alphabet  [Ref.  1:  pp. 
39-46] . 

In  a  previous  chapter  diacritics  were  discussed.  There 
are  five  basic  diacritics.  The  five  are  (Figure  3)  from 
right  to  left:  "Fat_ha",  "Dammah",  "Kassrah",  "Sukoon" ,  and 
"Shadah".  The  first  three  can  be  doubled,  in  the  same 
manner  as  double  quotes  in  Latin.  When  any  diacritic  is 
doubled  it  is  known  as  "Tanween"  and  adds  an  N  sound  to  the 
character.  The  Shaddah  has  the  same  effect  as  doubling  the 
consonant  in  English.  It  can  be  used  inconjunction  with  any 
of  the  first  three  or  their  "Tanween."  The  Sukoon,  when 
used,  means  that  the  character  must  be  read  in  primitive 
form,  versus  using  previous  diacritics. 

An  example  of  one  word  using  different  diacritics  will 
show  how  the  sound  and  subsequently  the  meaning  changes. 
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The  word  pronounced  "tilmeeth"  in  Arabic  means  a  student. 
The  "th"  at  the  end  is  the  character  "Thai"  in  Arabic.  The 
example  will  show  the  different  sounds  per  word  when  only 
the  last  character  has  different  diacritics. 

WORD  VOWELIZATION  PRONOUNCED 

TILMEETH  " FAT_HA"  TILMEETHA 

TILMEETH  "KASRAH"  TILMEETHI 

TILMEETH  "DAMMAH"  TILMEETHO 

TILMEETH  "SUKOON"  TILMEETH 

Using  the  "Tanween"  effect  with  the  first  three  diacritics, 
the  same  word  is  pronounced  as  follows: 

with  "Fat_ha  tanween"  TELMEETHAN 

with  "Kasrah  tanween"  TELMEETHIN 

with  "Dammah  tanween"  TELMEETHON 

Shaddah  has  the  ability  to  be  used  with  all  the  above  except 

the  Sukoon. 

The  use  of  diacritics  removes  the  ambiguity  in  the 
reading  of  text.  It  is  powerful  enough  to  change  the 
meaning  of  the  sentence  completely.  The  vowelization  of 
verbs  by  diacritics  will  change  the  sentence  to  passive 
tense.  In  Arabic  the  verb  comes  before  the  noun.  So  in 
Arabic  the  two  statements,  'was  stolen  Ali  a  book,  '  and 
'stole  Ali  a  book'  without  the  use  of  diacritics,  especially 
on  the  verb,  could  not  be  distinguished.  The  effect  of  the 
"er"  and  "ee"  in  English  as  in  "employer/employee"  is  also 
achieved  by  the  use  of  diacritics  in  Arabic  on  the  noun.  In 
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combination,  the  failure  to  use  diacritics  can  completely 
obscure  the  meaning  of  a  sentence.  For  example,  it  would  be 
as  if  in  the  sequenced  f ired/was_fired  employee/er  we  did 
not  know  which  of  each  alternative  is  meant.  The  employee 
either  was  fired,  or  fired  someone.  On  the  other  hand,  the 
employer  either  was  fired,  or  fired  someone.  See  Figure  6 
for  some  examples  using  vowels  and  without  vowels. 

Clearly  one  can  see  the  need  of  diacritics.  In 
religious  and  history  texts,  they  are  used  extensively.  In 
an  international  symposium  for  standardization  of  character 
code  sets  and  keyboards  for  Arabic  language  in  computers 
held  on  1-4  June  1980,  several  proposals  were  presented  by 
researchers  and  companies  that  already  have  developed  their 
own  character  sets  [Refs.  1,2].  All  the  proposals  and 
recommendations  agreed  on  including  the  diacritics.  This 
use  of  diacritics  will  be  beneficial  in  the  use  of  data 
bases,  artificial  intelligence  and  educational  textbooks. 


C.  THE  CONTEXTUAL  ISSUES 

The  mere  presence  of  a  character  in  different  locations 
within  a  word  determines  the  shape  to  be  written  or  read. 
Should  the  computer  do  the  analysis  and  free  the  user  from 
worrying  about  a  large  complex  character  set,  or  should  the 
keyboard  contain  all  possible  variations  of  each  character 
and  have  the  user  learn  to  master  more  than  one  hundred 
strokes  for  the  alphabet  in  addition  to  numerals,  special 
characters,  and  punctuation? 


One  popular  approach  is  to  provide  only  a  minimum  set  of 
required  characters,  usually  between  31  and  60  not  including 
diacritics,  numerals,  and  special  characters.  This  approach 
is  known  as  the  single  character  single  shape  keyboard.  The 
data  is  stored  in  memory  or  storage  devices  using  this 
reduced  code.  The  reduced  code  is  analyzed  by  an  interface 
to  give  the  right  form  or  shape.  The  interface  is  part  of 
the  display,  when  smart  displays  are  used,  or  a  shell  on  top 
of  the  "O.S."  to  contextually  analyze  the  character  form. 

The  issue  is  not  quite  settled  and  standardized  among 
all  Arabic  alphabet  users,  nor  Arabic  countries.  A  suc¬ 
cessful  meeting  of  authorized  people  from  all  concerned 
countries  have  not  yet,  to  my  knowledge,  agreed  on  a 
standard.  A  few  companies  who  stepped  into  the  market  early 
have  generated  their  own  version  of  character  code  sets. 
Some  companies  have  realized  the  gap  between  their  early 
implementation  and  ac’.ual  language  needs.  The  gap  was 
realized  more  when  the  use  of  the  produce  was  not  utilized 
in  all  the  areas  and  aspects  for  which  it  was  designed. 
Some  companies  have  realized  that  the  survival  and  popu¬ 
larity  of  their  product  depends  on  compatibility  with  at 
least  the  codes  of  a  character's  internal  representation. 
Some  companies  went  further  by  investing  in  research  for  an 
optimum  solution.  Language  experts  were  hired  and/or  con¬ 
sulted  by  companies  like  IBM,  TI,  and  WANG.  The  companies 
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are  following  efforts  for  solutions  and  continuing  further 
the  research  to  achieve  an  effective  solution. 

In  resolving  the  multiple  character  shapes,  most  com¬ 
panies  have  tried  some  reduction  of  all  possible  codes  to  a 
single  code  using  several  philosophies.  Texas  Instrument 
has  presented  [Ref.  1]  three  approaches  to  reduce  the  Arabic 
code . 

The  first  approach  was  called  "CORRESPONDENCE  &  DIFFER¬ 
ENCES."  This  approach  divided  the  alphabet  into  groups. 
The  first  type  A  have  characters  with  one,  two,  or  three 
points  (Appendix  B) .  The  second  type  B  are  without  points. 
The  last  type  C  contains  characters  having  at  least  one  form 
of  each,  for  example  character  "RA"  and  "ZA."  The  two  char¬ 
acters  have  the  same  form  with  a  point  on  the  "RA"  and  no 
point  on  the  "ZA."  The  idea  is  if  the  basic  form  has  one 
key  (code) ,  two  or  more  characters  will  have  the  same  basic 
form,  the  points  can  be  added  later. 

The  second  approach  was  called  "ROOTS  &  APPENDICES" 
(Appendix  B) .  The  approach  divided  the  alphabet  into 
groups.  Two  groups  have  six  characters  in  each.  Another 
group  has  four  characters.  Each  of  the  above  groups  have 
the  same  cursive  and  "APPENDICES."  The  "ROOT"  of  the  char¬ 
acter  can  be  used  at  the  beginning  or  in  the  middle  of  a 
word.  One  appendix  will  complement  each  root  of  a  group. 
This  will  require  a  total  of  seven  codes  for  a  group  of  six 
roots.  The  group  would  require  (for  six  characters,  each 
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with  three  contextual  forms)  a  total  of  18  separate  keys 
and/or  codes.  This  approach  implicitly  asks  for  more 
software  to  analyze  the  appendices.  A  character  may  be 
represented  by  two  codes  internally.  This  will  make  text 
storage  inefficient. 

The  last  approach  was  "CONTEXTUAL  ANALYSIS"  (Appendix 
B)  .  Texas  Instruments  has  developed  a  product  using  this 
approach.  The  DS990  Bilingual  System  can  handle 
Arabic/Latin  modes  and  display  them  on  a  screen  or  line 
printer.  The  contextual  analysis  approach,  in  all  the 
developments  seen  by  the  author,  uses  a  reduced  code  set. 
The  reduced  code  set  is  used  for  the  internal  representa¬ 
tion  of  data.  Keyboard  keys  of  the  Arabic  set  are  kept  to  a 
minimum,  usually  the  basic  form.  A  software  interface 
analyzes  the  character  contextually  and  displays  the  charac¬ 
ters  in  the  right  form.  This  interface  software^  in  some 
application  is  pushed  further  away  from  the  responsibility 
of  the  CPU  to  the  display  terminals.  Such  terminals  are 
called  'SMART'  terminals.  TI ' s  DS990  system  diagram  (Appen¬ 
dix  C)  shows  the  configuration  of  a  typical  system. 

TI  realized  the  need  for  diacritics  in  the  Arabic 
language  after  it  introduced  the  system  to  the  marketplace. 
TI,  at  an  international  symposium  held  in  Riyadh,  Saudi 
Arabia  between  1-4  June,  1980  [Ref.  l:p.  68],  in  an  effort 
at  standardization  of  code,  character  sets,  and  keyboards, 
recommended  that  the  Arabic  computer  systems  standards 


requirement  include  the  use  of  diacritics.  This  is  an 
example  of  the  approach  of  the  pioneer  companies  who  had  to 
define  and  develop  the  alphabet  codes  set.  Premature 
standards  will  automatically  be  overriden  by  the  authorized 
agency.  The  DS990  did  not  handle  the  use  of  diacritics. 
Since  the  use  of  diacritics  was  adopted  by  all  standards 
committees,  this  lead  a  few  companies  to  follow  a  new 
standard  that  supports  diacritics. 

ALIS,  Inc.,  introduced  BCON  ™  as  a  bilingual  operating 
system.  BCON  was  geared  toward  MS-DOS  based  microcomputers. 
The  bilingual  operating  system  is  an  interface  between  the 
operating  system  (O.S.)  and  different  applications  [Ref.  2]. 
This  bilingual  operating  system  adopted  the  single  key  or 
single  code  approach.  Each  character  is  represented  inter¬ 
nally  in  memory  by  a  unique  code.  BCON  also  fully  supports 
the  use  of  diacritics  in  text.  The  single  code  approach,  as 
mentioned  before,  requires  that  a  device  or  an  interface 
(hardware  or  software)  properly  analyze  the  character  and 
display  the  correct  form.  BCON  uses  Application  Screen 
Image  Compensations  (ASIC)  to  perform  the  contextual  analy¬ 
sis.  BCON  uses  separate  codes  and  fonts  for  each  character. 
The  internal  character  code  gets  translated  (mapped)  to  its 
output  code.  The  internal  code  has  4  to  5  output  codes. 
The  code  to  be  displayed  is  based  on  the  location  of  the 
character  within  the  word  (TI's  and  BCON's  system  will  be 
covered  in  more  detail  in  the  next  chapter) . 


Several  nations  use  the  Arabic  alphabet  today,  both 


Arabic  speaking  nations  and  non-Arabic  speaking.  It  is  a 
political  challenge  to  gather  concerned  nations  and  succeed 
in  establishing  a  standardized  code  set  acceptable  to  all  of 
them.  It  is  difficult  for  any  one  country  to  take  the  ini¬ 
tiative  and  responsibility  to  follow  such  a  program  until  it 
comes  to  life.  It  is  hard  for  a  single  country  to  conduct 
research  and  share  knowledge  with  another  country  that  is 
thousands  of  miles  away.  In  recent  years  as  cooperation 
between  Arab  nations  has  increased,  and  as  methods  of  com¬ 
munication  have  improved,  as  well  as  travel,  there  have  been 
more  productive  meetings  and  symposiums.  Several  countries 
have  mutually  cooperated  to  work  and  develop  a  possible 
solution  to  the  standard  codes  set  for  Arabic  in  data 
processing. 

Many  countries  like  Kuwait,  Iraq,  Morocco,  and  Saudi 
Arabia  have  hosted  meetings  and  symposiums,  listening  to 
experts  on  the  language,  and  in  the  data  processing  field. 
Researchers,  as  well  as  company  representatives,  have 
brought  up  points  to  consider,  shared  their  experiences,  and 
given  recommendations.  Several  existing  systems  have  been 
developed  or  proposed  by  companies  or  individuals  in  the 
field.  The  countries  that  have  been  exposed  to  technology 
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and  are  more  developed  than  other  Arabic  nations,  have  an 
urgent  need  to  set  standards  in  general.  Countries  like 
Morocco  started  as  early  as  the  1950's  to  set  standards  for 
printing  devices. 

The  north  African  countries  have  progressed  further  in 
this  research.  Morocco  shared  willingly  with  the  Arab 
nations  their  latest  research  and  developments  in  the  area. 
The  problem  of  choosing  an  existing  system,  with  some  or  no 
modification,  or  to  redefine  once  again  a  new  standard,  is 
also  a  political  issue. 

A.  SOLUTION  EFFORTS 

Several  companies  have  provided  results  of  their 
research  and  in  some  cases  have  implemented  systems,  giving 
recommendations  and  results  of  conducted  tests,  in  the  case 
of  keyboard  layout  proposals.  Companies  that  have  an 
interest  in  the  market  and  have  worked  in  the  Arabic  data 
processing  field,  have  no  authority  to  develop  a  code  set 
standard.  Government  representatives  are  the  authorized 
agency  to  do  such  a  task.  Several  companies  have  proceeded, 
given  a  lack  of  standards,  to  develop  Arabic  code  sets  and 
implement  them  on  hardware.  This  has  resulted  in  several 
incompatible  systems  of  code  sets.  Data  in  one  system  means 
different  things  in  another  code  set  system.  This  approach 
to  the  development  of  code  sets  has  both  disadvantages  and 
some  advantages  to  the  companies  involved. 
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Early  development  made  companies  as  well  as  users 
understand  the  weaknesses  of  the  developed  system.  For 
example,  TI's  DS990  system's  omission  of  diacritics  failed 
to  fulfill  the  needs  of  the  language.  On  the  other  hand,  by 
just  introducing  a  product  early,  companies  make  their  name 
familiar  to  customers.  The  customer  cannot  complain  about  a 
reasonable  attempt.  This  did  establish  a  good  reputation 
for  such  companies,  especially  when  they  adopt  the  approved 
standard  and  reintroduce  their  products.  In  addition  to 
developing  a  good  name,  they  gain  experience  in  the  process. 
This  will  help  in  introducing  an  earlier  product  complying 
with  the  standards.  So  a  company's  early  efforts  are  not  a 
total  waste. 

Since  early  implementation  ignored  including  diacritics 
use  with  text,  newer  designs  have  to  pay  special  attention 
to  their  use.  Data  base  machines  must  pay  attention  when 
sorting  and  searching.  The  representation  of  diacritics 
will  require  special  care  from  data  processing  machines. 
The  priority  of  characters  with  or  without  diacritics  must 
be  known  to  the  machine.  A  process  of  stripping  diacritics 
from  a  given  string  to  be  located  to  match  with  a  query, 
will  facilitate  the  search.  However,  the  target  of  the 
search,  when  found,  must  be  displayed,  and  stored  if 
updated,  in  the  vocalized  form.  Unlike  Texas  Instruments, 
IBM  chose  to  maintain  domination  in  the  market  for  type¬ 
writers  and  Arabic  only  EDP  machines.  IBM  did  conduct 
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studies  on  their  own  in  an  effort  to  develop  a  code  set  and 

keyboard  layout.  IBM,  represented  by  Mr.  R.P.  Hajjar  and 

Dr.  A.M.  Ismail,  presented  their  attitude  toward  a  bilingual 

code  set  standard  at  the  symposium  held  in  Riyadh,  Saudi 

Arabia,  in  June  1980  [Ref.  l:p.  72]: 

Meanwhile,  competent  people  from  the  Arab  world  and  from 
elsewhere,  have  addressed  the  same  subject  and  came  up 
with  a  variety  of  solutions  that  are  not  compatible  with 
each  other,  due  to  the  fact  that  they  reflect  the  require¬ 
ments  of  a  particular  Arab  country,  but  may  not  be  totally 
acceptable  by  the  neighboring  Arab  country.  This  is  the 
main  reason  why  IBM  has  not  implemented  such  solutions, 
but  will  look  forward  to  investigate  the  possibilities  of 
their  implementation,  in  case  these  solutions  are  adopted 
as  part  of  an  inter-Arab  standard. 

IBM,  TI,  and  Wang  have  shared  their  research  and  willingness 

to  achieve  a  solution  and  adopt  it  in  their  products. 

This  chapter  will  briefly  cover  three  systems: 

-  TI  DS990  System 

-  ALIS  Inc.,  BOON  System 

-  ASV-CODAR  Proposed  System.  ' 

B.  TI  DS990  BILINGUAL  SYSTEM 

DS990  is  a  bilingual  system  that  generates  seven  bits 
for  ASCII  codes  and  generates  an  8  bit  code  for  Arabic 
codes.  The  system  represents  the  Arabic  alphabet  with  32 
unique  codes  in  addition  to  13  special  characters.  The 
thirty-two  codes  are  the  internal  representations  of  the 
alphabet.  TI’s  system  uses  the  one  key  many  shapes 
philosophy.  The  32  codes  are  the  basic  character  set  of  the 
system  (Appendix  C)  .  The  one  key  many  shapes  approach 


requires  the  use  of  an  interface  with  a  smart  display  to 
display  the  correct  form  and  shape.  The  DS990  block  diagram 
(Appendix  C)  ,  shows  how  the  system  is  arranged.  The  32 
codes  are  mapped  to  128  less  13  giving  a  total  of  115  shapes 
that  can  be  displayed.  The  display  ROM  interface  contains 
all  128  shapes  (Appendix  C)  .  The  display  service  routine 
(DSR)  and  the  display  ROM  interface  contextually  analyze  the 
basic  code  set  and  display  the  data  correctly  by  mapping  one 
code  to  one  or  two  display  code(s). 

DS990  does  not  handle  diacritics.  It  also  increased  the 
optimum  set  from  31  to  3  2  unique  characters.  The  system 
considers  LAM  ALEF  as  a  single  character.  Two  clear  viola¬ 
tions.  The  use  of  diacritics  is  a  must  in  data  processing. 
The  LAMALEF  (DC  hex  value  in  the  basic  character  set) 
(Appendix  C)  is  composed  of  the  character  LAM  (D6  hex) 
followed  by  the  character  ALEF  (CO  hex)  _which  are  two 
separate  characters  and  should  not  have  a  unique  code.  The 
fact  that  the  table  shows  no  special  code  for  eastern  Hindu 
numerals  indicates  that  the  same  code  for  Arabic  numerals, 
known  as  western  Hindu,  is  used  for  both  representations 
(Figure  5)  .  Depending  on  the  display  mode,  the  eastern 
(Hindu)  and  the  Arabic  (western  Hindu)  are  displayed 
differently.  So  a  user  of  a  north  African  country  cannot 
use  the  western  Hindus  (known  as  Arabic  numerals)  in  Arabic 


mode.  This  is  not  desirable. 


DS990  stores  information  in  memory  in  logical  order  in 
Latin  mode  and  Arabic  mode.  The  display  ROM  interface  and 
the  control  program  map  the  internal  representation  of  one 
code  to  one  or  two  display  codes.  For  example,  to  display 
the  character  'SEEN'  as  in  the  basic  character  set  (CB  hex 
value)  (Appendix  H)  ,  the  character  is  represented  by  two 
display  codes.  The  first  code  is  the  value  BC  hex  followed 
by  the  code  8B  hex  in  the  display  ROM  interface  table. 

The  approach  followed  by  TI  is  the  typical  way  most 
companies  are  implementing  their  display  techniques.  How¬ 
ever,  the  disadvantage  is  the  omission  of  diacritics  and 
considering  "LAMALEF"  as  one  character.  TI  has  indicated 
they  now  believe  the  implementation  must  have  diacritics. 
[Ref.  1] 

C.  ALIS  INC.,  BCON  SYSTEM 

ALIS  Inc. ,  introduced  BCON  ™  as  a  bilingual  operating 
system  that  could  be  a  standard  to  follow,  or  at  least  close 
to  a  standard.  The  bilingual  operating  system  adopted  the 
single  key  single  code  approach.  Each  character  is  repre¬ 
sented  by  a  unique  code  internally  in  memory.  BCON  also 
fully  supports  the  diacritics  use  in  text.  BCON  was  geared 
toward  MS-DOS  based  microcomputers.  The  bilingual  operat¬ 
ing  system  is  an  interface  between  the  MSDOS  operating 
system  and  applications.  BCON  is  designed  to  facilitate  the 
adaptation  of  the  large  number  of  existing  MS-DOS 
applications  to  Arabic  [Ref.  2].  The  single  code  approach 


as  mentioned  before  requires  that  some  device  or  interface 
(hardware  or  software)  properly  analyze  the  character  and 
display  the  correct  form.  BCON  uses  Application  Screen 
Image  Compensations  (ASIC)  to  per-form  the  contextual 
analysis,  and  then  selects  the  correct  display  code 
(Appendix  D) . 

1 .  Hardware  and  Software  of  BCON 

BCON  hardware  is  another  board  on  top  of  the  Latin 
character  generator  board.  The  new  board  has  the  Arabic 
character  generator  with  the  required  wiring  to  allow  con¬ 
current  operation  of  both  character  generators.  The  two 
boards  are  back  to  back  and  use  one  slot  on  the  mother 
board — a  microcomputer.  Keyboard  caps  (or  stickers)  are 
provided  for  use  on  the  keyboard.  The  stickers  have  both 
alphabets  printed  side  by  side. 

The  software  is  a  program  which  when  activated, 
resides  in  low  memory  and  uses  19k  bytes.  Once  BCON  is 
activated,  it  can  be  set  in  Latin  "native"  mode  or  Arabic 
mode.  The  only  way  to  free  memory  is  to  reset  the  system. 
Both  modes  of  the  operating  system  will  allow  bilingual 
insertion  in  the  appropriate  direction.  In  their  early 
version  (up  to  early  1985) ,  ALIS  introduced  a  reduced  code 
called  Arabic  Reduced  Code  Information  Interchange  (ARCH)  . 
ARCII  is  the  internal  representation  of  the  characters  in 
memory  and  what  is  seen  by  the  operating  system. 


2. 


Arabic  Reduced  Code  for  Information  Interchange 
(ARCH)  is  ALIS's  early  attempt  to  define  a  code  set.  The 
reduced  code  (ARCII)  (Appendix  D)  is  the  internal  represen¬ 
tation  codes  of  data  in  memory.  The  ALIS  reduced  code  is 
completely  different  from  early  proposals  for  a  target 
standard  set  proposed  by  ASMO  (further  details  will  be 
covered  in  the  next  section) . 

The  code  uses  the  graphic  characters  for  the  Arabic 
set.  By  assigning  one  to  the  8th  bit,  128  additional  codes 
are  available  for  Arabic  codes.  This  allows  the  BCON  bilin¬ 
gual  system  to  mix  codes  and  use  both  ASCII  and  ARCII. 
There  are  46  different  codes  assigned  for  the  alphabet, 
starting  with  code  DO  hex  and  ending  with  FD  hex.  ARCII 
places  the  diacritics  early  in  the  table  to  give  them  pri¬ 
ority  in  sorting  algorithms.  This  early  positioning  in  the 
table  was  not  favorable,  however.  The  reasoning  will  be 
discussed  when  the  standard  code  and  the  format  justifica¬ 
tion  are  discussed.  The  escape  codes  and  special  characters 
should  not  be  redefined  for  ARCII  if  similar  ones  in  Latin 
exist.  This  minimizes  the  code  set  for  ARCII,  freeing  more 
code  for  future  expansion.  Codes  for  functional  codes  could 
be  minimized  by  using  the  international  one. 

ALIS  reduced  code  is  completely  different  from  early 
proposals  for  a  target  standard  set.  The  Arab  Organization 
for  Standardization  and  Metrology  (ASMO) ,  after  several 
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years  of  research  and  after  meeting  with  Arab  representa¬ 
tives,  recommended  the  use  of  CODAR  U-F.D.  as  a  standard  for 
Arabic  codes  (further  details  will  be  covered  in  the  next 
section) .  Subsequently,  ALIS  and  other  companies  adopted 
the  new  code  set  in  order  to  assure  compatibility  with  other 
applications  and  implementations.  BCON's  original  version 
of  reduced  code  (ARCH)  (Appendix  D)  is  the  internal  repre¬ 
sentation  of  information  in  memory. 

The  form  or  appearance  of  characters  is  not  a  major 
issue  as  in  how  it  should  be  displayed.  This  is  dependent 
on  the  machine  resolution  and  capabilities.  The  fonts  and 
style  of  displayed  texts  vary  from  one  machine  to  another. 
ASMO  has  recommended  that  the  style  of  displayed  text  be 
left  to  developers.  This  has  left  a  lot  of  room  for  manu¬ 
facturers  to  be  creative  and  compete  for  quality  work  for 
the  benefit  of  the  user. 

3 •  Operating  Principles  of  BCON 

BCON ,  once  loaded,  resides  in  memory  using  19k  of 
low  memory.  BCON  has  three  code  sets.  The  three  code  sets 
are:  reduced  code  (ARCII)  ,  key  code  and  display  code. 
Figure  7  shows  how  the  three  codes  are  integrated  with  each 
other.  A  list  of  the  three  code  sets  is  provided  in  Appen¬ 
dix  D.  ARCII  includes  the  diacritics  as  a  part  of  the  code 
set.  This  was  set  as  a  requirement  of  the  CODAR  U-F.D. 
standards.  BCON  receives  the  key  code  and  stores  it  in 


memory  in  reduced  code  form. 


The  reduced  code  form  is 


analyzed  by  BCON  and  contextually  analyzed  and  displayed  in 
the  correct  form.  In  the  display  process,  BCON  appends  if 
necessary  what  is  called  "TAIL  GENERATION"  to  some  charac¬ 
ters  if  they  fall  at  the  end  of  a  word  [Ref.  2]. 

The  early  work  on  BCON,  as  well  as  the  work  of  other 
companies,  must  be  modified  to  correspond  to  the  new 
standards.  ALIS  in  early  1986  introduced  a  new  mode  in 
addition  to  ARCH.  The  new  mode  uses  the  ASMO  approved  code 
set.  No  documents  are  available  at  this  time.  However,  as 
mentioned  before,  previous  effort  was  not  totally  lost.  The 
company  still  utilizes  the  contextual  analysis  developed 
earlier,  with  minor  modifications.  The  same  is  true  for 
their  printer  driver  software.  This  is  a  good  example  of 
how  early  development  enables  a  company  to  react  quickly  to 
new  demands . 

D.  ASV  CODAR-U  SYSTEM 

In  researching  the  early  efforts  initiated  by  official 
organizations  or  government  agencies  for  inter-Arab  unifi¬ 
cation  of  the  codes  set,  two  names  were  always  associated: 
CODAR  and  Dr.  Lakhdar.  A  few  acronyms  are  important  here: 

CODAR  :  Code  Arabs  (French) 

ASV  :  Arabe  Standard  Voyelle  (French) 

IERA  :  Institute  d' Etudes  et  de  Recherchers 
I ' Arabisation 

IBI  :  Intergovernmental  Bureau  for  Informatics 

COARIN:  IBI  Committee  on  the  use  of  Arabic  in  Informatics 


ALESCO:  Arab  League  Education  Cultural  and  Science 

Organization 

SASO  :  Saudi  Arabian  Standards  Organization 

ASMO  :  Arab  Organization  for  Standards  and  Metrology 

Dr.  Ahmed  Lakhdar  Gazal,  Director  of  IERA  (Institute  for 
Research  and  Studies  for  Arabization  in  Rabat,  Morocco)  has 
been  associated  with  the  CODAR  project  for  several  years. 
Dr.  Lakhdar  proposed  that  the  Arab  nations  adopt  the  CODAR 
system  as  a  standard  for  telecommunications.  IERA  was 
working  as  far  back  as  1955.  The  standardized  Arabic  Code 
was  a  dream  many  people  were  expecting  and  needed  for  many 
years.  However  they  have  no  power  over  defining  it  or 
making  it  official,  assuming  it  is  acceptable. 

The  CODAR  system  is  a  long-going  project  that  is  geared 
for  setting  standards  for  several  fields  of  interest.  The 
project  covers: 

-PRINTING 

-  TYPEFACES 

-  TRANSFER  LETTERS,  SELF-ADHESIVE  TYPES 

-  SLUG-CASTING  MACHINES 

-  MOVABLE  TYPE  COMPOSITIONS-CASTER 

-  PHOTOCOMPOSITION 
TYPEWRITERS 

INFORMATICS  AND  DATA  TRANSMISSION 
TELECOMMUNICATIONS 


This  chapter  is  concerned  with  Informatics  and  Data  Trans¬ 
mission.  However,  a  lot  of  credit  must  be  given  to 


personnel  behind  CODAR.  It  took  CODAR  a  lot  of  effort  and 
dedication  by  IERA's  staff  to  accomplish  a  unification.  A 
long  list  of  acknowledgments,  appreciation,  and  financial 
support  letters  were  coordinated  by  CODAR  from  several  coun¬ 
tries  and  organizations.  A  list  of  participants  include: 

Moroccan  Ministry  of  Education  (1956) 

First  Conference  of  the  Arab  National  Commissions  for 

UNESCO  (1958) 

First  Conference  on  Arabization  (Rabat,  1961) 

UNESCO  (Arab  book-keeping  experts  meeting)  (Cairo,  1972) 

A  long  list  of  occasions  and  dates  are  listed  [Ref.  l:pp. 
207-210] . 

Under  Informatics  and  Data  Transmission  there  were  three 
versions  of  the  7-bit  code  system.  They  are: 

Seven  bit  CODAR  I  :  first  coding  scheme  of  the  ABV 

characters 

Seven  bit  CODAR  II:  a  proposition  for  a  unified  Arabic 

coding  scheme,  discussed  at  regional 
(IBI)  meeting  at  Bizzert,  Tunisia, 
June  1976 

Seven  bit  CODAR  U  :  unified  coding  scheme  for  the  Arab 

countries  proposed  by  COARIN  (IBI 
committee  on  the  use  of  Arabic  in 
informatics)  at  a  meeting  in  Rome, 
June  1977. 

The  seven  bit  CODAR  I,  CODAR  II,  and  CODAR  U  (Appendix 
E)  are  code  set  proposals.  CODAR  I  was  produced  by  EURAB 
and  the  printers  were  manufactured  by  the  Italian  firm  SELI. 
CODAR  II  is  a  subsystem  of  CODAR  I.  The  subsystem  can  be 
obtained  by  removing  all  possible  combinations  of  "Harakat" 
(i.e.,  Fat 'ha,  Kassrah,  and  Dammah)  with  the  "Shaddah." 


The 


subsystem  also  leaves  out  three  Persian  characters,  opening 
and  closing  square  brackets,  backslash  and  a  few  character 
variant  shapes. 


CODAR  U  fully  supports  vocalization  with  all  possible 
"Shaddah"  combinations  with  the  "Harakat."  This  system  is 
the  closest  to  being  acceptable  by  ASMO  and  approved  as  a 
standard.  ASMO's  approval  will  give  the  system  official 
status. 

E.  THE  STANDARDIZED  SET 

In  1980  CODAR  U  was  accepted  as  a  working  basis  for  a 
basic  code  set.  Recommendations  and  modifications  were  to 
be  presented  to  ASMO  in  order  to  formalize  the  code  set. 
The  next  step  was  to  distribute  it  to  ASMO's  members. 
Member  countries  insure  that  it  is  implemented  accurately. 

During  a  meeting  held  between  22-24  April  in  Rabat 
(Morocco) ,  the  final  code  for  the  proposed  standard,  called 
CODAR  U-F.D.  was  finalized  and  submitted  to  ASMO  along  with 
six  recommendations  (Appendix  F)  .  The  conference  recom¬ 
mended  ASMO  to  distribute  and  test  the  code  by  IERA,  SASO, 
and  the  National  Center  for  Information  in  Tunisia  before 
enforcing  the  code.  ALESCO  and  ASMO  were  also  recommended 
to  make  every  effort  for  the  adoption  of  the  code  by  all 
Arab  countries. 

Finally,  on  October  21,  1982  ASMO  adopted  the  code  pre¬ 
pared  by  IREA,  and  ALESCO.  This  code  was  the  result  of  the 
CODAR  U-F.D.  proposed  in  April,  1982  at  Rabat.  The 
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modifications  and  changes  are  included  (Appendix  G) .  There 
are  a  few  points  to  consider.  There  are  31  codes  for  the 
alphabets,  3  codes  for  "Harakat,"  2  codes  for  "Shaddah"  and 
"sukoon,"  5  codes  for  "Hammzah,"  3  codes  for  "Tanween," 
totalling  44  codes.  Their  location  must  not  be  changed  in 
the  table  under  any  circumstances.  The  "Hamzah"  in  all 
variations,  on  top  or  under  characters,  are  considered  forms 
of  "Hammzah."  The  "Hamzah"  is  placed  in  the  beginning  of 
the  code  table,  which  in  searching  means  any  character  with 
"Hamzah"  associated  with  it  should  be  expected  higher  in 
order  (equivalent  to  "A"  in  Latin) .  This  concept  will  con¬ 
fuse  users  when  searching  or  sorting.  The  results  may  be 
surprising  for  sorting  algorithms.  In  sorting,  the  table 
allows  a  simple  sort.  Errors  will  result  from  the  occur¬ 
rence  of  diacritics  and  the  code  60  hex  in  the  table  (6/0). 
The  code  60  hex  is  used  for  connection  or  extending  a  word 
for  formatting  purposes.  So  a  sorting  algorithm  should 
strip  text  of  the  diacritics  and  the  connection  dash 
(similar  to  Latin  underscore)  first,  then  sort  the  text 
according  to  the  basic  31  character  code.  The  user  must  be 
educated  about  all  the  remarks  mentioned  in  the  reasoning  in 
ASMO's  final  form  of  the  code  set.  Another  convention  was 
that  the  character  comes  first  in  words  that  are  vocalized. 
The  form  to  follow  is: 


::=  { 


<CHARACTER>  <SHADDAH>  <DIACRITICS> 


* 

1 


WORD 


So  the  "Shaddah"  comes  before  the  diacritics  if  used  for  a 
character.  The  second  convention  is  if  the  pure  word 
matches  in  sorting,  the  diacritics  then  should  be  used  by 
the  sorting  algorithm  as  qualifiers.  In  my  opinion,  this 
violates  the  Regularity  Principle  in  programming,  where  the 
user  must  be  concerned  and  remember  all  the  exceptions. 
This  does  not  in  any  way  mean  there  is  an  easier  way. 

F.  CONCLUSION 

The  ASMO  code  set  is  the  standard  Arabic  code  set  the 
Arab  countries  must  enforce  in  their  countries. 
Subsequently  all  companies  in  the  area  must  adopt  and  use  a 
standard  code  set.  The  competition  is  now  directed  toward 
improving  the  display  application  with  high  resolution  and 
graphic  capabilities.  Printing  devices  also  are  an  area  for 
manufacturers  to  compete  in  printing  different  Arabic  styles 
and  fonts.  The  contextual  issue  is  left  as  a  flexible  issue 
to  the  implementors  to  research  and  develop  for  their  indi¬ 
vidual  products.  The  display  form  of  text  on  monitors  and 
printing  devices  will  not  affect  the  internal  representation 
of  the  data,  which  must  be  compatible  with  the  standard  code 
set.  This  may  result  in  several  display  sets  developed  by 
the  companies  as  their  view  and  intention  of  displaying  a 
good  Arabic  text.  Hopefully  this  should  create  a  stable 
base  to  work  with  and  encourage  development  of  products 
based  on  the  ASMO  standards  and  conventions  listed  in 
Appendix  G. 


V.  INTERFACE  DESIGN  GENERAL  APPROACH 


The  lexical  translator  will  generate  Latin  code  from  an 
Arabic  source  code  in  Pascal  syntax.  The  Pascal  compiler 
can  compile/ run  the  Latin  code  to  generate  an  output.  The 
interface  will  generate  a  correct  Latin  code  given  that  the 
Arabic  source  code  is  in  correct  syntax.  The  translator 
will  give  minimum  help  to  correct  the  Arabic  code.  The  user 
must  understand  the  syntax  and  the  semantics  of  the  language 
to  write  correct  source  code.  The  interface  is  not  an 
interactive  type  of  translator.  The  design  is  generally  the 
same  for  all  Pascal  compilers.  The  interface  must  always 
consider  the  environment  it  will  work  in.  The  interface  has 
two  environments  to  consider:  the  source  code  bilingual 
system,  and  the  compiler  environment.  From  the  portability 
and  compatibility  point  of  view,  the  translator  will  be 
limited  to  a  particular  Arabic  standard,  and  a  particular 
PASCAL  implementation. 

The  bilingual  implementation  has  its  own  function  codes. 
Those  codes  are  embedded  within  the  Arabic  source  code,  if 
generated  under  the  bilingual  operating  system.  The 
bilingual  operating  system  used  here  is  BCON  from  ALIS,  Inc. 
There  is  a  list  of  function  codes  in  Appendix  D.  The  PASCAL 
compiler  used  here  is  TURBO  PASCAL  from  Borland,  Inc. 
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The  Arabic  implementation  utilizes  the  upper  half  of  the 
255  character  set  used  by  graphics  to  display  Arabic  fonts. 
Some  Pascal  compilers  will  accept  any  of  the  255  characters 
as  legal  characters  for  use  in  string  data.  Turbo  Pascal, 
for  example,  allows  the  entire  set  of  255  characters.  This 
is  one  reason  why  Turbo  Pascal  is  used  in  this  thesis  as  a 
target  environment  for  the  generated  code.  The  interface 
will,  however,  generate  a  correct  PASCAL  code  even  if  the 
source  code  follows  standard  Pascal. 

The  compiler  will  always  refer  to  the  Turbo  Pascal 
compiler  even  though,  from  a  theoretical  point  of  view,  it 
should  be  any  Pascal  compiler.  Similarly,  since  there  is  no 
standard  representation  of  Arabic  data,  i.e.,  available  and 
implemented,  we  use  the  BCON  operating  system,  using  ARCH, 
as  the  internal  representation  of  data  in  memory. 

A.  MAJOR  CONCEPTS 

The  interface  looks  at  any  piece  of  code  (token)  as  one 
of  several  types.  These  types  are: 

-  Literal  string 

-  Comment 

-  Integer 

-  Identifier 

-  Functional  operator. 

Literal  strings  are  constants  and  the  interface  does  not 
alter  the  ASCII  value.  The  comments  are  surrounded  by  '(*' 
and  '*)'  in  Arabic  eguivalent  codes.  Integers  are  important 


and  easy  to  handle  since  there  is  an  isomorphic  relationship 


between  Arabic  integer  tokens  and  Latin.  A  real  number 
token  is  made  up  of  two  integer  tokens  separated  by  a  func¬ 
tional  operator.  An  identifier  is  any  legal  name  in  Pascal, 
either  a  reserved  word  or  user-defined.  Functional 
operators  are  all  the  codes  that  are  used  for  addition, 
brackets,  pointer  arrows,  etc.  In  setting  the  specification 
for  programming  in  Arabic  Pascal,  the  optimum  goal  is  to 
have  a  one-to-one  relationship  between  the  Latin  and  the 
Arabic  special  characters.  Also  we  want  to  avoid  overload¬ 
ing  the  use  of  special  characters. 

1 .  Literal  Strings 

Literal  strings  are  used  for  assigning  into  string 
variables  and  for  read  and  write  commands.  Strings  are  used 
to  interact  with  the  user  in  an  application  and  understand 
the  performance  of  the  program.  Therefore  we  do  not  alter 
these  strings.  The  literal  string  is  any  string  of  charac¬ 
ters  surrounded  by  single  or  double  quotes.  It  is  the  pro¬ 
grammer's  responsibility  to  verify  the  content  of  an 
assigned  string.  The  literal  string  can  have  any  character 
of  the  entire  set  80  hex  ...  FF  hex. 

2 .  Comments 

The  comment  length  is  limited  to  one  line.  The  com¬ 
ment  is  enclosed  by  an  opening  bracket  followed  by  an  aster¬ 
isk,  and  ends  with  an  asterisk  followed  by  a  closing 
bracket.  When  the  translator  encounters  the  beginning  of  a 
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comment  it  looks  for  the  end  of  the  comment.  The  comment  is 
considered  as  one  token.  The  translator  will  not  alter  the 
content  of  the  comment  since  it  is  for  the  use  of  the  pro¬ 
grammer  only. 

3 .  Integers 

Integers  are  any  consecutive  digits  from  0-9  with  no 
separation  in  between.  For  example,  the  integer  printing 
format  "2245:6"  is  considered  as  three  tokens  as  far  as  the 
translator  is  concerned.  The  first  token  is  the  integer 
"2245,"  the  second  is  functional  operator  the  third  is 
the  integer  "6"  token. 

Real  numbers  are  made  up  of  three  parts  as  one  would 
expect.  They  are  integer  token,  Arabic  numeric  comma,  and 
integer  token. 

4 .  Identifiers 

All  legal  Pascal  names  fall  under  this  category. 
This  includes  reserved  words,  ana  variable  names.  The  token 
is  identified  first  as  an  identifier,  then  looked  up  in  the 
reserved  words  group.  If  it  is  not  in  the  list  then  it  is  a 
variable  name.  Variable  names  include  variables,  labels, 
procedure  and  function  names.  When  an  identifier  is  encoun¬ 
tered  and  it  is  not  a  reserved  word,  then  it  is  given  an 
identifier  number.  The  identifier  number  is  stored  with 
other  information  about  the  token  in  a  hashing  scheme  in  a 
symbol  table.  The  token  is  looked  up  in  the  symbol  table. 
If  it  is  not  entered,  then  it  will  be  entered  in  the 


beginning  of  the  link  list  of  the  same  hash  key.  Since  the 
primary  user  of  the  translated  code  is  the  compiler,  the 
program  will  have  meaningless  variable  names.  However,  the 
translator  will  generate  a  file  called  "DICTIONARY"  contain¬ 
ing  each  identifier  number  and  the  Arabic  token  associated 
with  it. 

5.  Functional  Operator 

Tokens  are  identified  by.  separators  and  terminators. 
Blanks  are  separators,  as  well  as  other  codes  that  have  a 
function  other  than  being  separators.  For  example,  the  plus 
and  minus  sign  as  well  as  the  up_arrow  symbol  in  PASCAL  are 
separators.  If,  for  example,  the  variable  rootA . left_sun 
was  the  Arabic  token  it  will  be  translated  into  something 
like,  id_lA.id_2,  where  the  identifier  numbers  are  entered 
for  the  Arabic  tokens. 

The  scope  of  the  variables  will  distinguish  fre¬ 
quently  occurring  variable  names.  If  id_l  occurred  in  two 
declarations,  the  compiler  will  distinguish  between  two 

occurrences  of  id _ 1,  depending  on  the  location  of  the 

declarations.  Therefore  the  translator  does  not  need  to 
concern  itself  with  multiple  uses  of  the  same  name. 


B.  OPERATING  PRINCIPLES 

The  translator  goes  through  several  phases  and  each 
phase  has  a  sub-task.  The  process  begins  with  the  name  of 
the  Arabic  source  code  file.  The  file  is  opened,  the  target 
output  file  is  initialized  and  a  dictionary  table  file  is 


opened.  The  second  phase  fills  a  buffer  with  a  code  segment 
of  the  source  code,  a  line  at  a  time.  The  line  is  broken 
into  tokens.  Each  token  is  given  a  type  and  then 
translated.  The  cycle  is  repeated  for  each  lineup  to  the 
end  of  the  source  file. 

1 .  File  Opening  and  Initializing  Phase 

The  program  starts  with  the  prompt  for  the  user  to 
input  the  source  file  name.  The  file  name  is  checked  for 
existence  and  then  reset  for  reading.  The  file  name  is  used 
to  open  two  more  files,  the  dictionary  file,  and  the  output 
file.  The  initialization  is  concerned  with  the  hash  table 
that  has  information  regarding  the  record  structure  of  the 
identifier's  symbol  table.  The  rest  of  the  parameters  are 
optional  features  such  as  to  list  the  source  comments  with 
the  output  code.  Another  feature  is  debugging  for  tracing 
the  program  in  the  translation  while  the  translator  is 
scanning  and  translating  the  source  code.  Both  comments  and 
debugging  features  should  be  easily  set  at  any  point  in  the 
source  code.  The  rest  of  the  parameters,  for  example,  line 
number,  identifier  number,  are  initialized. 

2 .  Reading  and  Decomposing  the  Source  Code 

An  input  buffer  is  filled  from  the  source  code  and 
scanned.  A  line  at  a  time  is  read  from  the  buffer  and 
checked  for  special  instructions  (directives)  for  the 
translator.  If  the  line  is  not  a  directive,  it  is  checked 
to  see  if  it  is  a  comment.  If  the  line  is  a  comment  or 
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written  out  depending  on  the  comment  option.  The  comment 
option  is  a  Boolean  variable  set  by  the  user  within  the 
program  source  code,  to  either  omit  or  write  out  the  comment 
tokens  in  the  generated  file.  The  line,  or  the  remainder  of 
the  line  then,  is  decomposed  into  tokens.  Tokens  are 
identifiers,  integers,  blanks,  or  special  characters. 
Identifiers  are  either  reserved  words  or  user-defined 
identifiers.  Reserved  words  are  matched  with  their 

associate  Latin  reserved  word.  User-defined  identifiers  are 
given  a  label  number  in  the  sequence  of  their  first 
appearance,  if  it  does  not  already  exist.  Integer  tokens 
are  scanned  and  each  digit  is  mapped  into  its  matching  Latin 
digit.  Special  characters  are  given  their  equivalent  Latin 
characters,  such  as  Arabic  and  Latin  semicolon.  Blanks  are 
copied  as  it  makes  for  better  formatting  of  the  generated 
code . 

The  investigation  of  the  token  type  is  based  on  the 
first  character  of  the  next  token  in  the  input  buffer.  For 
example,  if  the  first  character  is  a: 

-  Letter:  Then  investigate  the  possibility  that  it  is  an 

identifier . 

-  Digit:  The  token  must  be  an  integer. 

-  Other:  Then  it  must  be  a  special  character. 

In  this  phase  only  the  identifiers  are  translated.  When  a 
user-defined  identifier  is  encountered,  and,  if  it  has  not 
previously  been  recognized,  it  is  given  the  next  identifier 


number  in  sequence.  Reserved  word  tokens  are  stored  in  a 
constant  table,  in  a  record  format.  Each  record  has  an 
Arabic  word  and  the  matching  Latin  one.  Any  identifier 
token  is  first  looked  up  in  the  table.  If  found  then  the 
index  of  the  matched  record  is  passed  back  to  the  main 
program.  The  integer  tokens  are  given  the  type  integer  and 
passed  back  to  the  main  program.  If  any  of  the  above  is  not 
true  then  we  get  one  character  and  pass  it  individually. 

In  short,  each  token  is  given  a  token  type,  length, 
and  passed  back  to  the  main  program.  Reserved  words  are 
passed  back  with  the  match  index  additionally.  Identifiers 
are  also  inserted  in  the  symbol  table.  If  not  found,  their 
identifier  number  (in  Latin  characters)  is  passed  back. 

3 .  Token  Translation  Phase 

The  tokens  are  translated  into  Latin-based  on  the 
token  type.  The  integer  tokens  are  translated  by  mapping 
each  Arabic  (Eastern  Hindu)  digit  into  its  Latin  (Western 
Hindu)  associated  digit.  Reserved  word  tokens  are 
translated  by  writing  their  matched  Latin  reserved  word, 
using  the  match  index  found  earlier.  User-defined  identi¬ 
fiers  are  replaced  by  the  identifier  number  assigned  to  it. 
The  rest  of  the  special  characters  are  looked  up  in  a  "CASE 
OF"  (a  PASCAL  control  statement)  list  or  assigned  into  a 
constant  table  (array) .  This  model  uses  a  case  statement. 
As  each  user  identifier  is  trans-lated  and  written  out  in 


the  output  file,  it  is  also  written  out  in  the  dictionary 
table  along  with  the  Arabic  token  associated  with  it. 

4 .  File  Closing  and  Ending 

The  last  phase  is  to  close  the  source  file,  diction¬ 
ary,  and  the  generated  output  file.  This  phase  will  only  be 
reached  at  normal  program  execution.  The  program  will  ter¬ 
minate  if  there  is  a  character  code  not  in  the  range  of  the 
Arabic  alphabet  defined  by  the  bilingual  operating  system. 
Long  tokens  and  comments  will  cause  errors  and  should  stop 
the  translation,  since  translating  a  comment  makes  no  sense. 

C.  DESIGN  GOALS 

The  interface  is  supposed  to  generate  from  any  Arabic 
source  code  a  Latin  code  in  PASCAL  syntax.  The  Arabic  pro¬ 
grammer  must  master  PASCAL  programming  in  his  native 
language.  Essentially  little  syntax  and  no  semantic 
checking  will  be  performed  on  the  source  code.  The  com¬ 
piler  job  is  to  scan  and  perform  the  syntax  and  semantics  on 
the  translated  code.  Some  help  must  be  provided  for 
tracing,  and  debugging  should  be  incorporated  into  such  an 
interface.  The  compiler  gives  the  error  messages  in  Latin. 
This  could  be  utilized  in  several  ways.  One  way  is  to  keep 
the  line  numbers  of  the  source  code  and  the  generated  code 
as  close  as  possible.  The  error  messages  usually  are  stored 
in  a  text  file  and  can  be  translated.  This,  along  with  the 
line  number  of  the  error  location,  can  be  combined  to  give 
the  location  and  type  of  the  source  code  error. 
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A  second  way,  if  the  error  messages  cannot  be  translated 
in  their  file,  is  to  translate  the  error  messages  and  return 
them  out  with  the  error  number.  The  Arabic  programmer  can 
look  up  the  error  number  in  Latin  and  the  line  number  of  the 
error,  then  look  up  the  translation  of  the  error  and 
explanations.  In  both  ways  a  few  hints  regarding  the  errors 
and  possible  causes  should  be  provided  to  the  user. 

D.  DESIGN  LIMITATIONS 

The  design  does  not  use  or  handle  diacritics  at  all  as 
far  as  reserved  words  are  concerned.  This  could  cause  error 
and  personal  interpretations  of  how  the  reserved  word  is 
written.  Since  most  reserved  words  are  clear  once  read,  the 
user  must  not  type  any  vowels  with  the  reserved  words  in  the 
program.  Similarly,  to  not  duplicate  the  translation  of  a 
single  user-defined  identifier,  and  eliminate  the  complica¬ 
tion  of  debugging  of  such  cases,  the  user  should  not  use  the 
vowels  in  his  defined  identifiers.  The  diacritics  may  be 
used  in  literal  strings  and  headings  of  reports.  Several 
factors  may  affect  and  prevent  the  use  of  diacritics.  Some 
sorting  routines  sort  independently  of  diacritics.  Since 
vowelization  can  upset  the  sorting  order  and  the  rules  for 
sorting  the  same  name  with  different  vowelization .  A  second 
reason  is  that  the  location  of  the  vowelization  of  the 
character  is  not  standardized.  A  third  reason  is  that  the 
resolution  of  terminals  is  poor  and  hard  on  the  eye  to 
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distinguish,  for  example,  between  the  "FAT'HA"  and  the 
single  quote  symbol  in  printed  or  displayed  form. 

The  design  therefore  will  not  handle  vowels  in  the 
Arabic  source  code.  However,  it  should  be  noted  that  the 
option  of  including  the  diacritics  requires  few  changes  in 
the  design,  and  a  lot  of  attention  from  the  Arabic  pro¬ 
grammer.  The  attention  is  required  to  rewrite  his  own 
sorting  routine  that  sets  the  ARCH  value  for  the  vowelized 
source  code.  Also  the  programmer  must  be  consistent  with 
his  use  of  vowels  with  identifiers  for  the  above  reasons. 

The  display  and  print  justifications  cannot  be 
controlled  easily  within  the  program  since  the  bilingual 
operating  system  does  not  use  a  standard  unified  code  for 
Arabic  display  and  print  mode.  For  example,  in  BCON,  the 
operating  system  used  for  the  implementation  of  this  thesis, 
if  you  are  editing  an  Arabic  screen  mode  then  the  curser  in 
the  entire  code  will  start  at  the  far  right  of  the  screen. 
This  right  justification  is  for  the  Arabic  format  and  inden¬ 
tation  in  Arabic  texts.  Therefore,  if  you  exit  the  editor 
you  must  set  the  screen  mode  to  Latin  screen  mode,  otherwise 
the  "C:>"  prompt  will  be  displayed  in  the  far  right  of  the 
screen.  So  for  the  sake  of  simplicity  to  the  user  and 
consistency  on  the  behalf  of  the  generated  codes,  the 
display  codes  are  left  out  of  the  translator  control  and  are 
under  the  control  of  the  display  system  of  the  bilingual 
operating  system.  The  modes  can  be  set  with  an  external 


escape  code  to  the  printer  or  a  sequence  of  key  strokes  to 
set  the  screen  to  Arabic  mode. 

These  limitations  can  be  resolved  once  there  is  a 
standard  set.  I  believe  the  bilingual  operating  system 
should  by  default  handle  the  justification  issue,  and  allow 
the  user  to  turn  this  option  off.  This  is  in  the  range  of 
two  to  five  years  to  come  in  the  industry  involved  with 
Arabic  text  handling. 


VI.  PROGRAM  MODEL 


A.  INTRODUCTION 

The  Lexical  Translator  program  is  intended  to  be  simple, 
flexible,  and  to  demonstrate  feasibility  of  the  concept. 
Speed  and  efficiency  was  not  a  primary  goal.  Features  can 
be  added  as  needed  based  on  the  response  of  users  of  the 
program. 

The  program  will  require  the  supervision  of  a  good 
PASCAL  programmer  to  assist  the  compilation  and  execution  of 
the  translated  code.  The  assistance  could  be  achieved  by 
simple  detailed  instructions  on  how  to  use  the  program  to 
generate  output  code. 


B.  PROGRAM  ENVIRONMENT 

The  Translator  is  developed  under  a  certain  environment, 
and  until  there  is  a  unified  standard  for  a  bilingual 
operating  system,  program  portability  and  compatibility  will 
be  limited. 

1 .  Hardware  Environment 

The  program  is  developed  using  an  IBM  XT  personal 
computer,  It  can  be  just  as  well  developed  using  an  IBM  PC 
Jr.,  or  IBM  At.  The  IBM  XT  has  640  kilobytes  of  RAM  memory, 
20  megabyte  hard  disk,  two  half  height  floppy  disks,  and  the 
ALIS  Inc.,  graphics  board.  The  board  is  made  up  of  two 
boards  back  to  back.  The  first  board  is  a  Paradise  color 


graphics  board.  The  second  board  is  on  top  of  the  paradise 
board  and  it  has  the  Arabic  character  generator  and  the 
necessary  connection  circuitry  needed.  The  two  boards  fit 
in  one  slot  on  the  mother  board  of  the  XT  computer. 

The  keyboard  is  an  IBM  PC  keyboard  with  cap  stickers 
for  the  keys.  Each  sticker  has  two  to  four  different 
characters,  for  Arabic  and  Latin.  The  keyboard  layout  is 
displayed  in  Appendix  D. 

An  Epson  FX  85  dot  matrix  printer  is  used  for  the 
listing  of  the  program.  The  printer  has  an  Arabic  driver  to 
display  Arabic  characters. 

2 .  Software  Environment 

ALIS  Inc. ,  BCON  bilingual  operating  system  was  used 
in  developing  the  thesis  program  and  test  runs.  BCON 
resides  in  low  memory  using  about  2 OK  bytes.  The  BCON  is 
supposed  to  be  transparent  to  the  DOS  operating  system.  DOS 
stands  for  Disk  Operating  System  used  by  IBM  microcomputers. 
The  BCON  operating  system  requires  special  skill  and  more 
than  average  user  knowledge.  BCON  is  mainly  required  for 
generating  the  Arabic  fonts,  and  interpreting  and  mapping 
the  key  strokes  to  their  associated  ARCH  values.  The 
interpretation  and  mapping  are  performed  under  the  Arabic 
mode  only.  The  Arabic  characters  are  stored  as  hex  values 
ranging  from  80  hex  up  to  FF  hex.  This  range  of  values  is 
reserved  for  graphics  under  the  DOS  operating  system.  This 


means  any  Arabic  character  code  is  considered  a  graphic 
character  in  the  absence  of  BCON. 

An  important  concept  must  be  pointed  out.  The 
presence  of  BCON  is  to  display  the  right  form,  font,  and  the 
indentation  of  Arabic  text.  So  with  minimum  skill,  a  pro¬ 
grammer  can  develop,  review,  correct  Arabic  characters  in 
any  DOS  compatible  machine.  Then  the  result  can  be  dis¬ 
played  under  BCON,  where  BCON  . can  interpret  the  graphics 
character  as  ARCH  code,  and  display  the  correct  textual 
form  of  the  ARCII  code  by  sending  the  appropriate  display 
code  to  the  terminal  or  the  printer. 

When  writing  long  Arabic  texts,  it  is  much  easier  to 
do  so  under  BCON,  with  the  aid  of  an  Arabic  word  processor. 
The  simple  EDLIN  editor  available  on  DOS  distribution  disk, 
or  Turbo  PASCAL  editor  of  version  2.1  and  below,  will  work 
also.  There  is  some  limitation  to  what  one  can  use  under 
BCON  and  still  display  Arabic  characters.  BCON  requires  two 
conditions  for  compatibility  when  using  any  application. 
First  BIOS  interrupts^  16  Hex  and  10  Hex  are  called  to 
access  the  keyboard  and  the  screen  respectively.  Second, 
the  application  must  handle  8-bit  characters.  [Ref.  2:  p. 
3-1] 

Turbo  PASCAL  version  2.1  was  used  to  write  the  main 
program  and  resource  file.  The  printer  interface,  called 

^Information  about  the  interrupts  can  be  found  in  DOS 
technical  manuals  for  personal  computers. 
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MPD  by  ALIS  [Ref.  2],  is  implemented  for  several  printers. 
The  name  stands  for  Multi  Printer  Driver.  The  MPD  was  used 
to  drive  the  Epson  FX  85  to  display  the  Arabic  characters  in 
the  program  listings,  and  sample  tests  (Appendices  H,  I) . 

C.  PROGRAM  BODY 

The  Lexical  translator  is  designed  to  be  easily  modi¬ 
fied  and  should  be  done  when  the  updated  version  of  BOON 
utilizing  the  unified  standard  code  set  is  available.  The 
program  is  modular  and  could  be  rewritten  in  "C"  or  FORTRAN. 
The  program  is  designed  to  generate  a  correct  output  file 
from  a  correct  input  source  file.  The  program  will  not 
interpret  the  result  and  the  programmer  must  exercise  crea¬ 
tivity  and  care  as  his/her  programming  advances,  to  assure 
correct  results  and  clear  output. 

The  printable  output  of  any  developed  program  is  either 
a  string  of  characters,  or  mathematical  results.  Since  any 
string  assignment  is  not  altered,  this  will  result  in  no 
difficulties  for  string  output.  If  the  result  is  a  real  or 
integer  number,  the  result  will  be  displayed  based  on  the 
BCON  digit  mode.  The  program  did  not  concern  itself  with 
numerals  since  all  the  users  are  familiar  with  the  Western 
Hindu  Numerals  (Latin)  .  Also,  BCON  has  an  option  that 
allows  the  user  to  swap  the  digits  in  the  operating  system 
environment.  So  for  BCON,  analyzing  the  results  of  numeric 
calculation  will  be  duplicating  the  same  work.  This  may  be 
a  limitation  under  an  operating  system  other  than  BCON. 


1. 


The  program  has  two  main  files  that  are  used  for  the 
generation  of  the  output  code.  The  main  file  and  the 
resource  file.  The  main  file  contains  constant  declara¬ 
tions,  data  structure  declaration,  variable  declarations, 
procedures  and  functions,  and  main  program  body. 

The  resource  file  has  the  assignments  of  a  constant 
array  declared  in  the  main  program  and  is  used  as  an  include 
file.  The  resource  file  has  a  subset  of  the  reserved  words 
and  standard  function  names.  The  resource  file  is  a  very 
useful  modular  concept  since  you  can  replace  the  PASCAL 
resource  file  with  one  for  the  language  "C".  With  minimum 
changes  in  the  constants  and  directives  one  could  use  one 
Translator  with  several  resource  files,  one  for  each 
language,  to  Lexically  translate  from  Arabic  to  one  of  many 
Latin  compilers  syntax.  This  program  focus  is  on  the  Turbo 
PASCAL  syntax. 

2 .  Generated  Files 

The  translator  will  generate  two  files: 

-  A  Dictionary  file  with  the  same  name  and  "DIC" 
extension . 

-  An  Output  file  with  the  same  file  name  and  "PAS" 
extension. 

The  program  will  generate  the  desired  output  in  the  "PAS" 
file.  The  dictionary  file  will  be  updated  each  time  an 
identifier  is  encountered  for  the  first  time.  User-defined 


Arabic  identifiers  are  translated  to  identifies  of  the  form 


" id_000  ...  id_999 . " 

3 .  Kev  Variables  and  Data  Structure  Declarations 

The  external  file  "Resource . Pas"  is  an  assignment  of 
a  constant  array.  Each  element  of  the  array  is  a  record. 
The  record  has  two  components.  The  first  component  is  the 
Latin  reserved  word  or  function  name,  and  the  second 
component  is  the  Arabic  translated  (matching)  word.3 

The  user-defined  identifiers  are  handled  by  a 
hashing  scheme  and  a  symbol  table.  The  decision  was  to 
demonstrate  an  efficient  way  to  store  and  retrieve  identi¬ 
fiers.  The  lexical  translator  will  be  constantly  looking  up 
any  non-reserved  identifier  in  a  symbol  table  to  insert  it 
or  to  get  its  Latin  match  if  predefined.  To  improve  effi¬ 
ciency,  the  program  uses  a  direct  chain  Hashing  scheme  [Ref. 
3 : p.  45]. 

The  identifier  is  passed  to  a  function  and  given  a 

key  number  by  Function KEY.  With  a  hashing  formula  the 

function  calculates  the  key  number  of  the  identifier.  The 
key  number  is  a  location  in  the  Hash  table.  The  content  of 
this  specific  location  is  pointer  to  a  word_record  which 
either  contains  the  word  or  is  where  a  new  record  should  be 
inserted  in  case  the  word  was  not  found.  Words  having  the 
same  key  number  will  be  linked  together  in  a  linked  list. 


3The  translation  is  in  no  way  a  standard  or  profes¬ 
sionally  translated.  The  translation  was  made  for  demon¬ 
stration  purposes. 


The  incident  of  having  several  words  with  the  same  key 
number  decreases  the  efficiency  of  Hashing  (see  Ref.  3  on 
how  to  avoid  Hashing  collision  and  when  to  use  Hashing)  . 
The  word  record  has  the  following. 


Id_No 

' 

the  identifier  number  in  the  sequence  of 
insertion. 

Length 

- 

number  of  characters  of  the  identifier. 

Lastchar 

- 

location  of  the  last  character  in  the 
symbol  table.- 

Nextword 

- 

pointer  to  the  next  identifier  with  the 
same  key  number. 

Latin_Id 

- 

the  Latin  identifier  assigned  to  the 
identifier. 

With  the  above  word  (identifier)  information,  we  can  locate 
the  word  in  the  symbol  table.  The  spelling  table  is 
declared  as  an  array  of  5000  characters.  The  size  is  an 
estimate  and  can  be  changed  as  one  can  predict  a  closer 
estimate.  The  symbol  table  is  implemented  as  a  linked  list 
and  its  size  can  vary  dynamically  so  as  to  be  as  large  as 
necessary. 

The  translator  looks  for  tokens  using  two  methods. 
The  first  method  uses  a  pair  of  delimiters  to  identify  the 
token.  The  pair  define  the  beginning  and  end  of  a  token. 
Token  classes  that  can  be  identified  by  this  method  are 
comments,  literal  strings,  and  directives. 

The  second  method  recognizes  a  token  by  its  first 
character.  Examples  of  this  class  are  integers,  and  identi¬ 


fiers.  The  second  method  includes  tokens  with  one  character 


such  as  separators  and  terminators.  Both  separators  and 
terminators  will  be  referred  to  as  delimiters  throughout  the 
program.  The  delimiters  are  defined  in  a  constant  set.  The 
Hex  values  of  the  set  can  be  interpreted  with  the  aid  of  the 
ARCH  table  (Appendix  D)  . 

Errors  are  a  user-defined  data  type.  Types  of 
errors  are,  for  example,  long_token,  long  comment,  and 
long_literal  string.  All  of  the  above  errors  are  expected 
to  occur  as  a  result  of  failure  of  the  programmer  to  end  a 
comment  or  literal  string. 

The  token  types  are  defined  to  be  one  of  the 
following: 

-  Blanks 

-  Reserved_word 

-  Identifier 

-  Literal_String 

-  Control_Code 

-  Comment 

-  Integer 

-  Functional_Operator 

-  Unclassified 

-  Illegal 

These  are  the  main  declarations  of  the  program.  The 
definition  of  the  tokens  and  assignments  of  the  variables 
will  be  covered  in  the  following  sections. 


.v .  --.y.y  .v .  ••  ■> 


62 


4 .  Token  Classes  I  and  II 

Class  I  tokens  are  recognized  using  the  first 
method.  This  includes  the  following  types  of  tokens: 

Literal_String:  This  token  begins  with  Arabic  quote  mark, 

single  or  double,  and  ends  with  it.  The 
Hex  values  are  97  Hex  and  A2  Hex. 

Comments  :  Begins  with  right  bracket  followed  by 

asterisk  and  ends  with  an  asterisk 
followed  by  left  bracket. 

Directives  :  Are  strings  in  curly  brackets.  This 

feature  is  for  debugging.  The  directives 
will  allow  the  user  to  choose  between 
commented  Latin  source,  with  original 

comments,  and  debugging  option  to  display 
on  the  monitor  the  tokens  and  their 
types . 

Class  II  covers  the  identifiers,  including  reserved 
words,  and  integers.  The  remainder  of  token  types  will  be 
reviewed  shortly. 

Identifiers  and  Reserved  Words:  Begin  with  an  Arabic 

letter  followed  by  an  optional  number  of 
underscore,  digit,  or  other  Arabic 

characters. 

Integers  :  Begins  with  digit  and  ends  with  any  non¬ 

digit  character. 

The  remainder  of  the  token  types  are  Functional_Operator , 
Illegal,  and  Unclassified.  Functional_Operator  tokens  are 
the  arithmetic  operators,  brackets,  asterisk,  decimal  digit, 
semicolon,  colon,  pointer  ' A',  etc.  The  illegal  token  is 
the  token  that  exceeds  its  defined  length.  This  condition 
is  used  to  set  an  error  message  to  pass  to  the  user  about 
the  location  of  an  error.  An  Illegal  token  is  also  set  if 

the  Hex  code  is  less  than  80  Hex.  The  legal  range  is  80  . . . 


FF  Hex.  The  control  code  is  any  escape  code  or  function 
call  within  the  range  of  Arabic  characters  ranging  from  80 
Hex  ...  FF  Hex.  The  Unclassified  token  type  is  used  as  the 


value  before  it  is  determined. 

D.  PROGRAM  MODULES 

The  Lexical  translator  will  call  several  procedures  and 
functions  in  the  process  to  generate  the  desired  code.  The 
main  body  of  the  program  calls  several  procedures  and 
functions.  The  program  modules  and  their  locally  declared 
procedures  and  functions  are  as  follows: 

Open_File 
Initialize 
Fill_Buf fer 
Token_and_Type 
Blank 
Comment 

Literal_String 
Integer_Token 
Identif ier_Token 
Reserved_Token 
Special_Char_Token 
Control_Char_Token 
Map_Identif ier_To_Latin 
Search 
Hash_Key 

Insert:  calls  Id_No 
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-r, 


h A 


Found 


Latin_Integer 

Get_Latin_Spec_Char 

Print_Error_Messages 

1 .  Qpen_File 

The  program  starts  by  calling  the  Open File 

procedure.  The  procedure  will  prompt  the  user  for  the  name 
of  a  file  to  translate  and  verify  that  the  file  does  exist. 
The  second  part  is  to  open  the  input  file  for  reading,  reset 
the  Output  file  for  writing,  and  the  Dictionary  file  for 
writing. 

2 .  Initialize 

Initialize  procedure  will  set  all  the  hash  table 
pointers  to  nil.  The  nil  values  are  used  to  indicate  that 
there  are  no  words  with  that  key  number  yet  initialized.  It 

will  also  set  the  initial  values  of  global  variables.  The 

module  is  called  once  at  the  beginning  of  the  program. 

3.  Fill_Buf f er 

This  procedure  will  get  a  line  of  source  code,  keep 
track  of  the  line  number  of  the  source  code,  and  set  the 
line  size  of  the  source  code.  This  module  is  continuously 

called  by  the  main  program  until  the  end  of  the  source  file 

is  reached. 

4 .  Buf f er_Emptv 

This  function  will  test  to  see  if  the  variable  Next_ 
Loc,  which  represents  the  next  token  location  on  the  line, 
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is  pointing  beyond  the  Line_Size  variable.  This  case  will 
set  the  function  to  true,  causing  the  main  program  to  call 
the  Fill_Buffer  procedure  to  refill  the  buffer.  This  module 
is  called  continuously  by  the  main  program. 

5 .  Token_And_Tvpe 

When  called,  this  procedure  is  passed  a  line  of 
source  code  and  the  location  of  the  first  character  of  the 
token  to  be  fetched.  The  procedure  gets  the  token  and  gives 
it  a  type.  The  procedure  initially  sets  the  type  of  the 
token  to  Unclassified  and  through  several  calls,  tries  to 
analyze  the  type  of  the  token.  The  first  convenient  check 
is  for  Comments.  It  should  be  noted  here  that  one  would 
like  to  place  the  most  likely  type  check  at  the  beginning  to 
reduce  time  of  analysis  of  the  token  type.  Another  reason 
for  searching  for  comments  first  is  because  they  are  the 
only  type  that  requires  two  characters  in  the  beginning  and 
the  end  of  the  token.  The  rest  can  be  predicted  just  by 
inspecting  the  first  character. 

If  the  token  type  is  not  set  to  Comment,  then  the 
module  calls  several  modules  with  a  case  statement.  The 
modules  are  called  based  on  the  first  character  after  the 
last  token  read.  The  Next  Location  variable  points  at  this 
character  in  the  input  line  buffer  called  "Line."  The 
possibilities  are: 


C.  C. 


FIRST  CHARACTER  LIKELY  TOKEN  TYPE 

Arabic  space  Blank (s) 

Double  or  Single  Quotes  Literal  string 
Arabic  Digit  Integer 

Arabic  Letter  Identifier 

Function  Code  Control  Char 

Other  Characters  Special  Characters 

Each  possible  token  type  above  represents  a  module.  The 
module  will  be  called  to  set  the  type  of  the  token. 

Looking  at  each  module  called  by  Token _ and _ Type, 

they  all  set  the  token  type  and  the  length  of  the  token. 
All  likely  token  types  except  for  Literal  Strings  and 
Comment  will  not  set  any  error  flags,  since  one  character 
will  satisfy  their  requirements.  For  example,  Blanks, 
Integers,  Identi-fiers,  Control  Characters,  and  Special 
Characters  all  could  be  one  character  long.  When  Literal 
String  and  Comment  modules  are  called,  they  must  begin  and 
end  with  a  predeter-mined  pattern.  So  an  open  comment  for 
longer  than  line  length  is  an  error,  and  the  same  for  a  long 
literal  string  token.  Token_And_Type  only  examines  the  Line 
Buffer  charac-ter  and  does  not  consume  it.  The  called 
modules  assign  the  character  to  the  Token  Buffer  and  advance 
the  pointer  of  the  Line  Buffer  one  character.  When  a 
successful,  token  type  is  assigned  the  module  sets  the  token 
length.  PASCAL  uses  the  first  array  location  to  store  the 
length  of  the  assigned  characters  in  bytes. 
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The  behavior  of  the  modules  called  by  Token _ and 

Type,  are  summarized  below: 

Blanks:  Will  keep  consuming  the  Line  Buffer  blanks 

(Arabic  and  Latin)  up  to  a  non_blank  character 
is  reached.  Blanks  will  set  Token  Type  and 
Length . 

Comment:  Consumes  the  characters  within  the  Arabic 

characters  range,  until  the  comment  closing 
mark  is  reached.  The  module  will  set  the 
error  set  to  long_comment ,  if  any  character 
lies  in  the  Latin  alphabet  range,  including 
the  end  of  file  and  carriage  return  (ASCII  OD, 
OA  Hex) .  The  error  is  long  comment  since  the 
comment  is  restricted  to  one  line  long. 
Comment  alters  the  opening  and  closing  bracket 
of  the  Arabic  comment  token.  The  characters 
are  the  Arabic  opening  brackets,  closing 
brackets,  and  the  asterisk,  having  the  Hex 
values  A8 ,  A9,  and  AA  respectively. 

Literal_String:  The  module  will  be  called  in  case  the 

next  characters  are  single  or  double  quotes. 
The  module  will  expect  to  be  terminated  with 
the  same  character  it  began  with.  If  the 
matching  character  is  not  reached  before  the 
end  of  the  line  it  is  considered  an  illegal 
token,  and  the  error  set  will  be  assigned  the 
type  long  token.  Valid  literal  strings  will 
not  be  altered.  However  the  opening  and 
closing  will  be  translated  to  single  or  double 
quotes  accordingly. 

Integer_Tok:  Stands  for  integer  token,  and  will  be 

called  when  a  digit  is  present.  The  module 
will  keep  assigning  the  Latin  digits  in  the 

token  buffer,  and  assign  the  Token _ Type 

Integer  to  the  variable  Tok_type. 

Identif ier_Tok:  Will  be  called  when  the  character  is  a 

letter.  The  single  letter  qualifies  as  an 
identifier  alone,  or  could  be  followed  by  an 
optional  number  of  Arabic  underscore,  digit, 
or  letter.  The  module  will  set  the  Tok_type 
to  Identifier.  The  module  has  no  effects  on 
error  set,  since  when  called  it  was  a  valid 
token  based  on  the  first  character  of  the 
token. 

Reserved_Tok:  The  module  is  called  when  the  token  found 

is  an  identifier.  The  module  will  check  if 


the  token  is  in  the  reserved  words  constant 
array  called  "Res_Word."  If  the  identifier  is 
a  reserved  word  the  index  of  the  table  is 
passed  back  to  the  main  program. 

Control_Char_Tok:  The  module  is  called  when  a  BCON 

function  code  is  the  next  character  in  the 
Line_Buffer.  The  module  assigns  one  character 
(code)  to  the  token  buffer. 

Special_Char :  This  module  assigns  one  character  to  the 

token.  The  token  will  always  have  one 

character. 

When  Token_and_Type  returns  the  token  type  to  the 
main  program,  a  case  statement  will  either  call  a  procedure 
or  do  the  processing  with  a  compound  statement.  The  blanks 
will  be  translated  to  Latin  ASCII  code  blanks.  The  returned 
comment  token  will  be  written  out  as  is.  Literal  strings 
are  written  out  literally.  Reserved  words  are  written  as  is 
using  the  Match_Index  in  the  Res_Word  constant  array.  The 
identifiers  are  looked  up  in  the  symbol  table.  If 

predefined,  the  token  identifier  number  is  returned  with  it, 
or  else  the  identifier  is  inserted  in  the  table  and  given  an 
identifier  number.  The  module  used  is  called  Map_Iden_To 
Latin. 


6 .  Map_Iden_To_Latin 

The  Identifier  token  is  received  and  searched  for 
with  a  procedure  called  Search. 

7 .  Search 


This  module  starts  by  calling  the  Hash_Key  function. 


a.  Hash_Key 

Hash_Key  calculates  the  token  key_no  with  a  hash 
formula.  The  key  number  is  used  to  look  up  the  pointer  of 
the  word  record  in  the  hash  table.  The  word  record  is  a 
linked  list  of  identifiers  of  the  same  key  number.  All  the 
pointers  are  initialized  to  nil  at  the  beginning  of  the 
program.  If  the  key  number  results  in  a  nil  pointer  value, 
that  means  there  is  no  such  word  in  the  symbol  table,  nor 
any  other  word  with  the  same  hash  key  number,  then  Search 
calls  Insert  to  insert  the  identifier  in  the  symbol  table. 

b.  Insert 

Insert  creates  a  word  record  at  the  beginning  of 
the  linked  list  and  stores  the  identifier  in  the  spelling 
table.  Insert  makes  a  call  to  I DEN_LBL_NO ,  which  uses  the 
global  variable  ID_NO  (sequence  of  appearance) ,  and  assigns 
an  identifier  number  in  the  word  record. 

If  the  pointer  is  pointing  at  a  word  record,  then  the  first 
word  in  the  linked  list  is  checked,  and  so  on  until  there 
are  no  more  word  records  in  the  list  or  the  word  is  found. 

c.  Found 

The  function  Found  checks  if  the  resulting 
pointer  is  pointing  at  the  exact  identifier  spelling. 

If  the  word  record  is  found  then  it  already  has  been 
assigned  a  specific  identifier  number  which  is  then  passed 
back  to  the  main  program  to  be  written  out  as  the  Latin 
identifier. 
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8 .  Latin-Int 

The  procedure  maps  each  digit  of  the  token  to  the 
Latin  digit  0...9,  and  passes  back  the  Latin  integer. 

9 .  Get_Latin_Soec_Char 

The  procedure  is  to  give  each  Arabic  special 
character  its  Latin  "functionally"  equivalent  character. 

10.  Print  Errors 

Based  on  the  error  set.  Print  Errors  will  send  the 
error  type  and  the  line  number  in  the  source  code  where  it 
was  encountered. 

E .  PROGRAM  DIRECTIVES 

The  program  offers  two  directives.  One  is  the  option  to 
keep  the  source  comments  in  the  output  file,  or  the  program 
will  omit  the  comments  by  default.  Two  is  the  option  to 
turn  on  and  off  the  debug  option  at  any  location  in  the  code 
at  the  beginning  of  a  line.  This  option  will  display  the 
tokens  and  their  types  as  they  are  scanned. 

The  program  is  demonstrated  by  a  list  of  test  runs  to 
verify  the  translation  of  reserved  words  and  special 
characters.  Also  a  sample  of  small  PASCAL  programs  are 
included  with  their  generated  files,  code  and  dictionary 
tables  (Appendix  I)  . 

E.  LIMITATIONS 

The  program  does  not  allow  the  user  to  use  the  'Include' 
directive  in  TURBO  PASCAL.  The  size  of  the  program  is 
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limited  by  TURBO  PASCAL  to  64k,  where  an  additional  code 
could  be  included  as  an  'Include'  file. 

The  program  is  set  to  handle  up  to  one  thousand 
identifiers.  This  is  a  reasonable  number  in  working  with 
TURBO  PASCAL  since  the  program  size  is  limited  to  64k  bytes. 

The  spelling  table  is  5000  characters  long.  That  means 
the  total  length  of  all  identifiers  can  not  exceed  5000 
characters.  The  programmer  can  avoid,  when  writing  long 
programs,  exceeding  the  limit  by  using  short  identifiers. 

The  program  will  not  generate  an  error  flag  if  a  Latin 
string  is  found  in  comments  or  literal  string.  This  is 
because  both  comments  and  literal  strings  are  not  altered. 

ARCH  provides  two  commas.  The  numeric  comma  is  used 
with  real  numbers  in  Arabic,  and  the  Arabic  Comma  is  used, 
in  this  specification,  as  the  Latin  comma  except  for  the 
real  number  case.  This  is  a  small  hurdle  in  the  case  of 
translating  the  generated  code  back  to  Arabic.  The 
appearance  of  the  two  Arabic  commas  is  different.  They  are 
18  0°  out  of  phase  on  the  vertical  axis  where  the  numeric 
comma  looks  like  the  Latin  comma.  The  decision  on  using 
both  commas  was  to  avoid  overloading  the  use  of  the  Arabic 


comma . 


VII.  CONCLUSION 

This  thesis  has  tried  to  narrow  the  gap  between  educated 
Arabic-speaking  people  and  computers  in  general.  The  target 
ages  are  mid-teenage,  and  forty-five  and  above.  The 
majority  of  these  two  classes  still  look  at  computers  as 
magic.  They  believe  man  created  them.  However  they  have  a 
hard  time  believing  that  man  tells  computers  what  to  do. 
With  that  attitude,  the  only  thing  that  can  convince  them  is 
to  help  them  to  write  small  programs  and  see  the  results. 
We  are  convinced  that  the  majority  will  get  rid  of  their 
fear  and  have  the  desire  to  explore  this  machine. 

In  short,  the  topic  of  the  interface  between  the  rich 
Latin  software  library,  and  the  Arabic  language  environment 
is  a  promising  area  in  the  sense  that  it  will  bring  those 
who  fear  computers  closer,  and  find  a  more  efficient  way  to 
get  the  job  or  hobby  done. 

A.  CONCEPT  FUTURE 

The  program  is  simple  in  concept  and  to  code,  but  the 
environment  where  it  is  expected  to  work  is  not  yet 
standardized.  The  standards  are  not  widely  implemented,  nor 
are  the  developers  of  bilingual  operating  systems  very 
helpful  in  responding  to  concerns  about  hardware 
compatibility . 
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Once  a  unified  environment  is  established,  then  the 
concept  could  be  developed  further.  The  goal  of  this  work 
was  to  illustrate  the  feasibility  and  avoid  specific  issues 
of  the  implementation  environment.  The  program  modules  were 
designed  to  be  adaptable  and  portable  for  several  purposes 
with  little  modification.  For  example: 

-  For  several  programming  language  translations,  such  as 
"C,"  FORTRAN,  and  BASIC,  we  only  need  several  resource 
files  and  several  special  character  sets,  one  for  each 
programming  language  requirement. 

-  For  several  code  sets,  including  different  languages,  we 
need  the  concept  of  a  bilingual  operating  system  that 
uses  the  upper  range  of  the  character  set  ranging  from 
80  ...  FF  Hex. 

-  The  program  can  work  in  a  Latin-only  operating  system, 
to  translate  source  codes  that  have  been  edited  using 
Arabic  code  set  values.  Also,  the  generated  source 
could  be  compiled  in  the  same  machine.  If  the  program 
is  interactive,  then  it  needs  to  run  under  a  bilingual 
operating  system. 


B.  LIMITATIONS 

The  bilingual  operating  system  was  not  well  documented 
as  far  as  how  some  of  the  function  codes  are  implemented 
during  editing.  Some  of  the  characters  have  two  codes  (such 
as  the  Arabic  multiply  sign  and  the  numeric  multiply  sign) . 
To  know  which  multiply  sign  is  generated  when  I  strike  a 
key,  I  had  to  use  an  editing  tool  to  display  the  code  in  Hex 
values  and  match  the  text  file  and  its  Hex  values. 

Right  indentation  is  relative  to  the  editor  mode.  If 
you  select  •  your  screen  mode  to  be  Arabic  and  you  read  a 
piece  of  Latin  code,  it  will  be  right  justified. 


The  user  must  be  careful  reading  data  files.  Some  data 
is  readable  only  in  Arabic  mode  and  some  data  is  readable 
only  in  Latin.  Also  the  data  displayed  may  have  been 
transformed  by  the  operating  system.  As  mentioned  before, 
the  user  could  use  the  "SWAP"  option  for  altering  ASCII 
digits  and  ARCH  digits  in  the  DOS  environment,  or  read  the 
digits  as  a  string  and  change  the  values  into  ASCII.  This 
is  important  in  order  to  perform  numerical  operations  with 
Arabic  digits. 

I  strongly  believe  that,  with  time,  standards  will  be 
developed  with  more  care  and  concern  for  the  user.  This  is 
the  reason  we  chose  not  to  design  the  program  for  a  specific 
system. 

It  is  hoped  that  this  work  will  benefit  other 
researchers  and  future  thesis  students  from  other  countries 
since  a  similar  concept  could  be  applied  to  other  languages, 
especially  languages  descended  from  Latin. 


APPENDIX  A 


FIGURES 

J  ^  t  £  **  *®  vj-^  ji  O" 

j  ®  o  f  J  d 

Figure  1.  The  28  Arabic  Alphabets 

i 

•■*££***»  o^  o-^  O"  O" 

4«5*<#.5®orJd 

Figure  2.  The  31  Alphabets  (Optimum  Set) 


NAME 


CHARACTER 


NAME 


CHARACTER 


ALEF 

1 

DAD 

BA  A 

U 

TAH 

TA  '  A 

U 

DHAH 

THA 

a 

u 

AIN 

JEEM 

i 

6HAIN 

HA  A 

t 

FA 

KHA  *  A 

t 

QAF 

DAL 

Jk 

KAF 

THAL 

LAM 

RA 

_) 

MEEM 

ZA 

J 

NOON 

SEEN 

o-“ 

HA 

SHEEN 

o" 

WAW 

SAD 

YA 

* 


K 


HAMMAZAH  a 
T  AAMARBQT  A  a 
ALEF  MAQSURA  .c 


Figure  3.  Arabic  Alphabet  Names 


«u  6  c  ->  ca  t.  t.rvo  er.  b- 


Eastern  Hindu  numerals 


98765432  1 

Western  Hindu  numerals 


Figure  5.  Hindu  Numerals 


Aa^uasll  ^Uxil  ^xll  cli^  cJU.pl  A^u 

^XuaaII  IaA  ^J]  laAPa  jl  c£l  j jJi  Ipla  d^Loil  <U  la^Poa 
J^i&A  OA  d-lUtll  ^t4X4  dalcpJI  ci/lll  JAAO  ^  c£c|J 

dill  <ctc\  ^a*uaU  Uacx 

^dull  ij4  ,3aSaI^  ^UJI  £03  Lplc  CAjJllI  cJaS\ 

QuVaVi  ^yjloll 


a.  without  Vowels 


>tiP»UI  ^pAVi  'g_^)l  Oj3  'JlA'pi  'a 


i*:' /:  *  *h  '  -*7  *  \\'_  * 


£.'*>- 
.*>> . ) 


I'aJj  'j'-fiaJI  4J  CV')  A-^Ll^l 

'c^cl'p  A-vll^l  IaA  pj  V*’-^  Vi  '■£]  Vaj 

k 

/juu^i  ',*uu^  i^ir^i  'ap'ai  *0i 

dill  'fit*' i  od^A&>  0>VU) 

P^l  ^  'oaJL^  <.£*!»  ^  ^  'el^'l 

UuJd  j_>*illc_ll 


b.  Uith  Vowels 
Figure  6.  Arabic  Text 
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TEXAS  INSTRUMENTS  APPROACH  TO 
BILINGUAL  OPERATING  SYSTEM 


Philosophy  of  Bilingual  Arabic 
Latin  Implementation  on  Microcomputer 

System 


Texas  Instruments 


ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


SPECIFIC  CHARACTERISTICS  OF  THE  ARABIC  LANGUAGE 


★  ARABIC  IS  WRITTEN  FROM  RIGHT  TO  LEFT 

★  THERE  ARE  SOME  VARIATIONS  IN  TYPES  OF  ARABIC  CURRENTLY  IN  USE  IN 
DIFFERENT  COUNTRIES 

★  THE  LANGUAGE  IS  A  FOUR  LEVEL  ONE.  A  CHARACTER  CAN  HAVE  UP  TO 
FOUR  SHAPES  DEPENDING  ON  ITS  POSITION  IN  THE  WORD  :  ISOLATED, 
INITIAL,  MEDIAL  OR  FINAL 

★  ARABIC  CHARACTERS  ARE  JOINED  WITHIN  A  WORD 

★  NO  UPPER  CASE  EXISTS  IN  ARABIC 


I’ll  start  my  presentation  by  a  brief  mentioning  of  some  of  the  characteristics  of  the 
Arabic  language  which  have  been  covered  in  previous  papers  and  which  affect  the  use  of 
the  Arabic  language  in  the  computer  field. 


ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


SPECIFIC  CHARACTERISTICS  OF  THE  ARABIC  LANGUAGE 


★  ONLY  THREE  CHARACTER  VOWELS  EXIST  IN  ARABIC  : 

ALIF  |  ,  OUAOU  J  ,  YAA  • 

★  VOWELISATION  IN  ARABIC  IS  ALSO  PERFORMED  THROUGH  THE  USE  OF 
DIACRITICS.  THESE  ARE  USED  : 

-  IN  THE  CASE  OF  SIMILARLY  WRITTEN  WORDS  TO  AID  THE  READER 

-  IN  RELIGIOUS  TEXTS  INCLUDING  THE  KORAN 

-  FOR  SCHOOL  TEACHING 

★  ARABIC  LANGUAGE  USES  INDIAN  NUMERICS,  WITH  THE  DECIMAL  POINT 
BEING  A  COMMA. 

★  THERE  ARE  ARABIC  SPECIAL  CHARACTERS  WHICH  INCLUDE  THE  ARABIC 

COMMA  *  ,  SEMICOLON  ♦  .  QUESTION  MARK  $  ,  ETC. 


ARABIC  COMPUTER  SYSTEMS 
PHILOSOPHY 


ARABIC  ALPHABET 


★  THE  BASIC  ARABIC  ALPHABET  IS  COMPOSED  OF  28  CHARACTERS 

★  THE  LAMALIF  WHICH  IS  COMPOSED  OF  TWO  CHARACTERS  LAM  +  ALIF 
IS  CONSIDERED  AS  ONE  CHARACTER 

★  THE  HAMZA  CAN  BE  WRITTEN  IN  MANY  DIFFERENT  WAYS  IN  ARABIC 
DEPENDING  ON  ITS  USE,  WITH  A  VOWEL  OR  ISOLATED 

★  IF  THESE  TWO  CHARACTERS  ARE  TAKEN  INTO  CONSIDERATION  THE 
ALPHABET  IS  30  CHARACTERS 

★  THE  TAMARBOUTA  IS  A  SPECIAL  CHARACTER  NOT  INCLUDED  IN  THE 
ALPHABET.  IT  IS  OCCASIONALLY  INCLUDED  AT  THE  END  OF  WORDS 
DEPENDING  ON  GRAMMATICAL  RULES 


ARABIC  COMPUTER  SYSTEMS 

BILINGUAL  SYSTEM  APPROACH 


SOLUTION  1  :  CORRESPONDANCE  &  DIFFERENCES 


★  THIS  STUDY  IS  BASED  ON  THE  CORRESPONDANCE  AND  DIFFERENCES 
BETWEEN  ARABIC  CHARACTERS.  THE  ARABIC  ALPHABET  MAY  BE  CONSIDERED 
AS  FORMED  OF  THREE  TYPES  OF  CHARACTERS  : 

-  TYPE  A  INCLUDES  CHARACTERS  HAVING  1 ,  2,  OR  3  POINTS  : 

•  M  M 

u  u  >  ■» 

«• 

-  TYPE  B  INCLUDES  CHARACTERS  WITHOUT  POINTS  : 

J  *  I*  J  J 

-  TYPE  C  INCLUDES  CHARACTERS  HAVING  AT  LEAST  ONE  FORM  IN  EACH  CASE  : 

★  IF  WE  ONLY  CONSIDER  THE  FORMS  WITHOUT  POINTS  WE  CAN  REDUCE  THE 
CHARACTERS  IN  EACH  TYPE  AND  THEN  ADD  THE  POINTS  AFTERWARDS 


ARABIC  COMPUTER  SYSTEMS 
BILINGUAL  SYSTEM  APPROACH 


SOLUTION  2  :  ROOTS  &  APPENDICES 


A  STUDY  BASED  ON  THE  USE  OF  APPENDICES  AND  ROOTS  TENDS  TO 
REDUCE  THE  TOTAL  NUMBER  OF  SHAPES  BY  CONSIDERING  A  ROOT  TO  BE 
USED  IN  INITIAL  A  MEDIAL  SHAPES  TO  WHICH  AN  APPENDIX  IS  ADDED  TO 

FORM  THE:  FINAL  OR  ISOLATED  SHAPES 


TYPE  A 

TYPEB 

TYPEC 

£  = 

C'  + 

• 

O" 

-*•-4 

u 

=  V. 

J 

• 

£  = 

C  * 

• 

.jL. 

A 

-*■ 

ill 

=  w 

£  = 

c  * 

kl 

=  c 

+  J 

• 

£  = 

c  * 

♦ 

kj'a 

+ 

& 

-  w 

3 

•# 

a  * 

c  * 

& 

*  L 

3 

r*  = 

c  * 

A 

- 

<r- 

* 

• 

j 

The  problem  with  this  solution  is  what  code  to  give  to  these  apprendices.  il  they  are  coded  would 
they  be  considered  as  characters  in  a  character  count?  How  would  high  level  languages  interpret  them  ’ 
How  would  special  s/w  (unction  interpret  them?  replace  —  insert  —  find  string. 

This  is  the  study  which  resulted  in  the  actual  Arabic  implementation  on  Texas  Instruments  equipment 
and  which  will  be  explained  in  this  paper. 


ARABIC  COMPUTER  SYSTEMS 
BILINGUAL  SYSTEM  APPROACH 

SOLUTION  3  :  CONTEXTUAL  ANALYSIS 


k  A  STUDY  BASED  ON  THE  USE  OF  SHAPING  ALGORITHMS.  USING  CONTEXTUAL 
ANALYSIS  TO  DETERMINE  THE  PROPER  SHAPE  OF  THE  CHARACTER,  FOUR 
GROUPS  ARE  IDENTIFIED 

-  GROUP  1  ONE  SHAPE  PER  CHARACTER 

-  GROUP  2  TWO  SHAPES  PER  CHARACTER 

-  GROUP  3  THREE  SHAPES  PER  CHARACTER 

-  GROUP  4  FOUR  SHAPES  PER  CHARACTER 

★  POSSIBLE  APPROACHES 

-  ONE-KEY  ONE-SHAPE  SIMPLIFIES  THE  SOFTWARE  BUT  USUALLY  LIMITS  THE 
SET  OF  ARABIC  CHARACTERS  AND  CREATES  A  COMPLEX  KEYBOARD  SINCE 
ALL  THE  ARABIC  CHARACTER  SHAPES  MUST  BE  PRESENT  ON  THE 
KEYBOARD. 

-  ONE-KEY  MANY-SHAPES  IMPLIES  MORE  SOPHISTICATED  SOFTWARE  BUT 
SIMPLIFIES  KEYBOARD  &  USER  INTERFACE 


( )l  these  2  approaches  the  2nd  one  lias  been  chosen  ami  iliis  will  be  covered  in  the  lollowinu  slides. 
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APPENDIX  C 


DS9900  BILINGUAL  COMPUTER 
SYSTEM  BY  TEXAS  INSTRUMENTS 


ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


COMMERCIAL  COMPUTING  REQUIREMENTS  FOR  THE  MIDDLE-EAST 

★  BILINGUAL  LATIN/ARABIC  DATA  INPUT  &  OUTPUT 

★  COBOL  DRIVEN  APPLICATIONS 

★  BILINGUAL  PRINTING 

★  BILINGUAL  SORT/MERGE 


SPECIAL  PRODUCTS  DEVELOPPED  TO  MEET  REQUIREMENTS 


★  BILINGUAL  DATA  ENTRY  TERMINAL 

★  BILINGUAL  MATRIX  PRINTER 

★  BILINGUAL  LINE  PRINTER 


SOI  I  WARE 


I  Itcsc  handle  both  in  the  natural  manner  +  software  simplified  k/vv  lor  operators  -1-  high  level 
languages  easy  handling. 


ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


CHARACTERISTICS  OF  BILINGUAL  DATA  ENTRY  TERMINAL 

★  BILINGUAL  VIDEO  DISPLAY  UNIT 

-  THE  CHARACTER  GENERATOR  ROM  GENERATES  7x8  MATRIX  FOR  ALL 
STANDARD  ASCII  Cl  IARACTERS  AND  128  ARABIC  SHAPES 

A  7  x  10  MATRIX  IS  USED  FOR  INTRICATE  ARABIC  CHARACTERS 


Latin  &  Arabic  can  be  displayed  on  the  screen  at  the  same  time. 


ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

★  BILINGUAL  KEYBOARD 

-  PROVIDES  5  MODES  OF  OPERATION  :  ARABIC,  LATIN,  SHIFT,  UPPERCASE 
&  CONTROL.  IT  CONSISTS  OF  91  KEYS 

-  PROVIDES  THE  USER  WITH  THE  CAPABILITY  OF  ENTERING  ARABIC 
AND/OR  LATIN  DATA  WITHOUT  CONSTRAINTS 

-  KEYBOARD  MULTIFUNCTION  CAPABILITY  IS  PROVIDED  BY  A  MODE 
SELECTION  KEY  AND  TWO  CHARACTER  SET  SELECTION  KEYS 

-  DATA  IN  EITHER  LANGUAGE  CAN  BE  ENTERED  IN  EITHER  MODE 

-  THE  KEYBOARD  GENERATES  7-BIT  CODES  FOR  LATIN  AND  8-BIT  CODES 
FOR  ARABIC 


»» 

* 2  | 

l 

M 

f  7 

l  ^  | 

CMO 

ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

ARABIC  CHARACTER  SHAPING 


★  32  BASIC  ARABIC  CHARACTERS  ARE  GENERATED  BY  THE  KEYBOARD 

★  A  CONTEXTUAL  ANALYSIS  OF  THE  ARABIC  DATA  IS  PERFORMED  BY  THE 
CONTROL  PROGRAM  TO  DETERMINE  THE  CORRECT  SHAPE  OF  THE 
CHARACTER  TO  BE  DISPLAYED 


★  IN  TOTAL  THE.TERMINAL  CAN  DISPLAY  1  1  5  SHAPES 

EXAMPLE  OF  SHAPING  PHOCEDIJRE 


1  VI  -l_ 

II II.  WUI<1J  : 

ENTER 

YAA  : 

t m 

TA  s 

KAF  : 

i 

LAM  • 

MIM  : 

SPACE  s 
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ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 


DEVICE  SERVICE  ROUTINE  INTERFACE  OVERVIEW 


★  THE  DEVICE  SERVICE  ROUTINE  IS  CONTROL  SOFTWARE  BETWEEN  THE 
USER'S  PROGRAM  AND  THE  VIDEO  DISPLAY  TERMINAL  (VDT) 


VDT  OUTPUT 


DISPLAY 

ROM 

INTERFACE 


PROGRAM 

INTERFACE 


Y  II  hA  I  A 
INI ‘Ml 


ARABIC 

VDT 

CONTROLLER 

=> 

ARABIC 

VDT 

OSR 

m  a'iiiiiiih. 

Il/W 

Si  ill 

Sweni  llcxihiliiv  by  Soliuarc  implement. 


wo 


ARABIC  COMPUTER  SYSTEMS 
DS990  BILINGUAL  SYSTEM 

BILINGUAL  TERMINAL  DISPLAY  ROM  INTERFACE 


APPENDIX  D 


BCON  BILINGUAL  OPERATING  SYSTEM  BY  ALIS  INC. 


Default  Reduced  Codes 


II  'e  ”• 

Oh 

01 

i  atio  character^  identical  to  oncinui  N^Cli  •'t-'  with  trie  eu  eption  ot  the 

r  iiw*  character*. 

Function  code  ^et  Bilingual  -Hreen  c*tvrati rz  Moo*  ( Imaged  a**  Latin  space' 
Function  code  r*»  t  Latin *On:\  rvre«n  Oi^ninr-*:  Moo*  (Imaged  as  Latin 

space i 

w» 

Numeric*  space 

81 

=  Arabic**  number  sign 

82 

*  Numenc  multiply  sign 

83 

6:  Arabic  ampersand  sign 

8-1 

Arabic  apostrophe  sign 

“  Numeric  percent  si^n 

Hn 

+  Numeric  divide  sign 

8  r 

i  Numenc  left  parenthesis 

i  Numenc  right  parenthesis 

Numenc  plus  sign 

8A 

Numeric  minus  sign 

8  B 

Numeric  less  than  sign 

HC 

Numenc  equals  sign 

«D 

Numeric  greater  than  sign 

sr 

hunction  code  st  -  \rjt\-  -i  reer.  1  jnciiJCv  Modi  (imaged  as  Arabic  space) 

8f 

Function  code  Latin  S-reeM  i.an^uave  Mocie  (imaged  as  Arabic  space) 

80 

function  code  >e'  Arabic  l  ine-  Language  Msii  (imaged  as  Arabic  space) 

81 

function  code  1  atm  1  me  1 anguage  Mod-  (imaged  as  Arabic  space) 

82 

r,  Arabic  commercial  a!  sign 

QT, 

Arabic  letl  square  bracket 

84 

Arabic  right  square  bracket 

4=) 

Arabic  upward  arrow  head 

Vh 

Arabic  underline 

Q" 

Arabic  reverse  apostrophe 

MU 

Arabic  lett  curlv  bracket 

U<» 

Arabic  vertical  line 

8  A 

Arabic  right  curU  bracket 

80 

Arabic  tilde 

8C 

(reser\  ed) 

8D 

i  reserved) 

8E 

Function  code  so  Line  Bnunuarc  (imaged  as  Arabic  space) 

8t 

(reser\  ed  1 

('r  Numenc  means  character  is  Arabic  but  has  intrinsic  right 
spacing  (i.e.  will  be  considered  part  ot  a  numeric  string). 
Arabic  means  character  has  intrinsic  left  spacing. 
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*  Nr.iC'n  multiple  -lev. 

\raru  dollar  -iga 
Arabic  percent  sign 
Arabic  period 

4"  Arabic  d i \  ide  sign 

Arabic  let:  parenthesis 
i  Arabic  right  parenthesi- 

*  Arabic  asterisk 
Arabic  plus  sign 

\  Arabic  comma 

Arabic  minus  sign 

<  \umenc  comma 
Arabic  solidu- 

‘  Arabic  digit  0 

^  Arabic  digit  1 

T  Arabic  digit  2 

'f’  Arabic  digit  1 

1  Arabic  digit  4 

O  Arabic  digit  5 

"l  Arabic  digit  b 

V  Arabic  digit  7 

A  Arabic  digit  h 

^  Arabic  digit  9 

:  Arabic  colon 

■  Arabic  semi  colon 

<  Arabic  greater  than  sign 

*  Arabic  equals  sign 

>  Arabic  less  than  sign 

b  Arabic  question  mark 


CO 

TAI1 

Cl 

_  KASHIDA 

c: 

-  SHADDAH 

C' 

.  S  L  K  L  \ 

C  4 

FATHA 

c? 

s  SHADDAH  FAT  HA 

Co 

*  FATHATAN 

C7 

S  SHADDAH  FAT  HATAN 

cs 

'  DAMMAH 

CO 

i  SHADDAH  DAMMAH 

CA 

"  DAMMATAN 

CB 

3  SHADDAH  DAMMATAN 

CC 

-  KASRAH 

CD 

«  SHADDAH  KASRAH 

CE 

»  KASRATAN 

CF 

S  SHADDAH  KASRATAN 

DO 

*  HAMZAH 

D1 

1  ALEF 

d: 

T  WASLA  ON  ALEF 

D3 

1  HAMZAH  ON  ALEF 

D4 

1  HAMZAH  UNDER  ALEF 

Do 

T  MADDAH  ON  ALEF 

Do 

w  BA  A 

D7 

w  PEH 

DH 

o  TA  A  MARBUTA 

D9 

w  TA  A 

DA 

cJ  THA  A 

DB 

£  IEEM 

DC 

s  SHEEM 

DD 

^  HA  A 

DE 

r  KHA  A 

Dl 

1  DA! 

HA  A 
KHA  A 
DAI 


Et> 

LI 


4 

J 


l: 

J 

i ; 

J 

1  4 

o~ 

a 

i jr 

BBS 

u* 

[ : 

Ex 

E4 

b 

EA 

t 

EB 

t 

EC 

sj 

ED 

j 

EE 

d 

EP 

PC 

J 

FI 

V 

F2 

"V 

m 

V 

n 

> 

PA 

V 

Fh 

r 

pr 

P* 

o 

1  Al. 

I<  \ 

Z  AIN 
M  [  M 
-IIA 
SHEEN 

*ad 

DAD 

T  AH 

DHAH 

AIN 

CHAIN 

LA 

QAP 

CAP 

GAP _ 

LAM 

l.AMAl.EF 

W  ASL  A  ON  l.AMALEF 

HAMZAH  ON  I  AMAI.EI 

HAMZAH  UNDER  LAN1ALEF 

MADDAH  ON  LAM  ALEP 

MEEM 

NOON 

HA 


I  A 
PB 
PC 
FD 
IL 
I  P 


j  W  AW 

•j  HAMZAH  ON  W  AW 
,j  ALEP  MAQSURA 
J  VA  A 

^  HAMZAH  ON  AAA 
\  Arjbn  re.  erst-  solidus 

Blank  IP  charaaerfirnao'd  as  Arabic  space! 


Key  Codes  to  Reduced  Codes  Table 


Key  code* 

English 

Reduced  Code 

Arabic 

Arabic 

(ASCII) 

Legend 

(ARCH) 

Legend 

Name 

20 

"i'  XC  : 

AO 

-!'••>  C: 

Arabic  space 

21 

A 1 

1 

Arabic  1 

22 

A2 

« 

Arabic 

23 

a 

81 

# 

Arabic  = 

24 

s 

A4 

$ 

Arabic  S 

25 

A5 

/. 

Arabic  °o 

26 

i 

83 

&' 

Arabic  & 

27 

* 

E8 

y 

TAH 

28 

( 

A8 

( 

Arabic  ( 

29 

) 

A9 

) 

Arabic  ) 

2A 

• 

AA 

a 

Arabic  * 

2B 

- 

AB 

♦ 

Arabic  - 

2C 

F9 

WAW 

2D 

- 

AD 

- 

Arabic  ■ 

2E 

E2 

J 

ZAIN 

2F 

/ 

E9 

J* 

DHAH 

30 

!» 

BO 

• 

Arabic  0 

31 

i 

B1 

Arabic  1 

32 

2 

B2 

T 

Arabic  2 

33 

3 

B3 

r 

Arabic  3 

34 

4 

B4 

t 

Arabic  4 

35 

5 

B5 

0 

Arabic  5 

36 

6 

B6 

n 

Arabic  6 

37 

7 

B7 

V 

Arabic  7 

38 

h 

B8 

A 

Arabic  8 

39 

9 

B9 

Arabic  9 

3A 

BA 

: 

Arabic  : 

3B 

EE 

cl 

KAF 

3C 

AE 

» 

Arabic  numeric  comma 

3D 

- 

BD 

M 

Arabic  ** 

3E 

A6 

Arabic 

IF 

BF 

* 

Arabic  1 

H:  Character  byte  of  key  code  word  only  (low-order  byte).  The 
scan  code  (high-order  byte)  is  not  modified  by  BCON. 


100 


.  t  .  i  »  i  .  i  , 


60 

E0 

THAL 

61 

E5 

UK 

SHEEN 

62 

- 

El 

V 

LAMALEF 

62 

FA 

*J> 

HAMZAH  ON  UAU 
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FINAL  CODE 
CODAR  U-F.D. 

Recommendation  of  the  final 
Meeting  Held  In  Rabat  (Morocco) 
In  22-24  April  1982 
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FOREWORD 


The  importance  ol'the  role  ol'the  information  channels  in  the  Arabic  world  is  becoming  increasingly 
obvious  in  all  sectors.  All  Arabic  countries  are  dealing  with  various  types  of  information  in  the  fields  of 
administration  organization,  planning,  science  and  technology. 

The  simple  concept  of  cooperation  between  the  Arab  countries,  and  the  positive  results  of 
standardization  make  it  necessary  to  introduce  a  unified  cipher  for  the  Arabic  characters  used  in  the  field 
of  information  exchange. 

In  this  connection  the  concerned  Arabic  organization  have  taken  considerable  measures  such  as  the 
two  meetings  which  were  held  in  Rabat  ( Morocco);  the  first  meeting  was  the  (Arab  experts  conference  tor 
the  unified  Arabic  cipher  in  the  field  of  information ).  It  was  held  w ith  the  cooperation  of  the  i  Arabic 
Institute  for  Researches  and  Arabization)  during  the  period  between  25 th-29th  Sept..  198 0.  The  second 
meeting  concerned  with  the  regulation  of  the  Arabic  cipher  in  its  final  shape  and  was  held  on  April  22-2-1. 
1982.  In  this  meeting  the  technical  committee  did  achieve  the  projected  corrections,  and  the  Arabic 
cipher  which  is  known  as  (COD.AR  L'.F.D.)  was  ready. 

Attached  are  the  reasons  for  modification  of  the  COAR-L’F.D..  the  recommendations  adopted  at 
the  meetings  and  the  final  shape  ol'the  unified  Arabic  cipher  which  will  be  formed  in  an  Arabic  standard. 
This  standard  will  be  distributed  to  the  ASMO  member  bodies  for  further  studying  and  approval  as  a 
prelude  to  the  actual  experimentation  and  application. 


RECOMMENDATION 

In  the  final  session  and  with  a  group  agreement  of  the  conferees  on  the  final  shape  of  the  unified 
Arabic  cipher,  the  following  recommendations  have  been  adopted: 

(1)  The  conference  requests  the  Arab  League  Education  Culture  and  Science  Organization 
(ALECSO)  and  Arab  Organization  for  Standards  and  Metrology  (ASMO)  to  adopt  the 
Arabic  cipher  which  has  been  agreed  upon,  and  take  all  necessary  measures  for  its  adoption 
and  enforcement  in  all  Arabic  countries. 

(2)  The  conferees  recommend  to  the  information  organization  that  use  Arabic  language  to 
experiment  the  new  cipher  before  enforcement. 

These  recommendations  shall  be  submitted  in  particular  to  the  i  Institute  tor  Research  and 
Studies  for  Arabization)  in  Morocco,  the  Saudi  Arabian  Standards  Organization  and  the 
National  Center  for  Information  in  Tunisia  for  the  purpose  of  testing  the  new  cipher  betore 
the  next  (ASMO)  meeting. 

(  3)  It  is  recommended  that  the  Arabic  cipher  in  us  new  and  final  share  be  adopted  by  the  Arabic 
association  tor  telecommunications. 

(4)  It  is  also  recommended  that  ALECSO.  the  ASMO  and  the  Arab  association  tor 
telecommunication  shall  make  necessary  coordination  to  use  Arabic  language  in  the  field  oi 
information  between  them  and  other  international  organizations  bodies  and  the  I  M  SCO 

(5)  The  meeting  recommends  an  emergency  session  ALECSO  and  ASMO  to  regulate  the 
specifications  of  the  devices,  the  printing  letters  and  their  forms  and  to  find  the  best  wav  of 
utilizing  computers. 

(h)  The  meeting  also  recommends  the  continuous  contact  between  ALECSO  and  ASMO  to  see  to 
the  best  execution  of  these  recommendation. 
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ARAB  STANDARD  SPECIFICATIONS 

449 

Data  processing  -  7  -  bit  coded  Arabic  Character  set  for  Information  Interchange 


ARAB  LEAGUE 

ARAB  ORGANIZATION  FOR  STANDARDIZATION 
AND  METROLOGY  (  ASMO  ) 


Preface 

This  Arabic  Standard  was  prepared  by  technical  committee  No.  8  (Arabic  characters  in  informatics). 
Among  the  parties  who  participated  in  its  preparation  are  the  Arab  League  Educational,  Cultural, 
and  Scientific  Organization  (ALECSO),  and  the  Institute  of  Studies  and  Research  for  Arabization  in 
Morocco. 

In  accordance  with  the  1982  Directives  for  the  Technical  Work  of  the  Arab  Organization  for 
Standardization  and  Metrology  -  Part  I:  Procedure  and  Working  Methods  -  this  Arabic  Standard  was 
adopted  by  the  resolution  of  the  General  Assembly  of  ASMO  No: 

(  R  342  /  G.A.  /  S  15  -  October  21,  1982  ). 


DATA  PROCESSING:  7-BIT  CODED  ARABIC 
CHARACTER  SET  FOR  INFORMATION 
INTERCHANGE 


0.  INTRODUCTION 

This  Arabic  Standard  specifies  the  properties  of  a  coded  character  set  using  "-bn  binary  codes  for 
information  interchange  among  different  types  of  data  processing  equipments  using  the  Arabic 
characters.  It  also  specifies  a  set  of  control  and  graphic  characters,  in  addition  to  ns  coded 
representation  inspired  from  ISO  646.  The  set  of  specific  graphic  characters  in  this  standard 
enable  us  under  all  circumstances  to  represent  Arabic  text  whether  it  is  totally  vowelized. 
partially  vowelized,  or  unvowelized.  This  standard  provides  the  possibilities  for  information 
interchange  for  special  applications,  as  well  as  the  possibilities  for  expansion  in  case  of 
insufficiency  of  the  coded  character  set.  This  Arabic  Standard  w  as  made  in  accordance  w  ith  ISO 
646.  and  the  following  points  were  modified  so  that  the  standard  ISO  646  is  convenient  for 
Arabic  usage: 

—  Tabic  1. 

—  Comments  on  this  table. 

Table  1  was  modified  in  such  a  way  which  permits  the  usage  of  the  coded  character  set  as  a 
separate  group  from  the  Latin  character  set  described  in  ISO  646  for  information  interchange, 
and  the  usage  of  basic  programs  in  Arabic  Language  for  the  purpose  of  complete  Arabization 
when  using  computers.  This  table  also  allows  the  usage  of  the  coded  character  set  together  with 
the  Latin  character  set  as  in  the  International  Standard  ISO  646  because  of  the  correspondence 
between  these  two  standards. 

Applying  this  standard  requires  several  application  standards  to  be  implemented  on  a  carrier 
(magnetic  carrier,  transmission  network,  etc.),  and  these  applications  are  specified  in  other 
standards. 

I.  SCOPE  and  field  of  application 

I  1  This  Arabic  Standard  contains  a  set  of  1 28  characters  (control  characters  and  graphic 
characters  such  as  letters,  digits  and  symbols)  with  their  coded  representation.  Most  of 
these  characters  are  mandatory  and  unchangeable,  but  provision  is  made  for  some 
flexibility  to  accommodate  special  national  and  other  requirements. 

1.2  The  need  for  graphics  and  controls  in  data  processing  and  in  data  transmission  has  been  taken 
into  account  in  determining  this  character  set. 

1.3  This  Arabic  Standard  consists  of  a  general  table  w  ith  a  number  of  options,  notes,  a  legend  and 
explanatory  notes. 

1.4  This  character  set  is  primarily  intended  for  the  interchange  of  information  among  data 
processing  systems  and  associated  equipment,  and  within  message  transmission  systems. 
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15  This  character  set  is  applicable  to  all  Arabic  alphabets. 

1.6  This  character  set  includes  facilities  for  extension  where  its  12^  characters  are  insufficient  for 
particular  applications. 

1.7  The  definitions  of  some  control  characters  in  this  Arabic  Standard  assume  that  data  associated 
w  uh  them  is  to  be  processed  serially  in  a  forward  direction.  Their  effect  w  hen  included  in  strings 
of  data  which  are  processed  other  than  serially  in  a  forward  direction  or  included  in  data 
formatted  for  fixed  record  processing  may  have  undesirable  effects  or  may  require  additional 
special  treatment  to  ensure  that  the  control  characters  have  their  desired  effect. 

2.  IMPLEMENTATION 

2.1  This  character  set  should  be  regarded  as  a  basic  alphabet  in  abstract  sense.  Its  practical  use 
requires  detinitions  of  its  implementation  in  various  media.  For  example,  this  could  include 
punched  tapes,  punched  cards,  magnetic  tapes  and  transmission  channels,  thus  permuting 
interchange  of  data  to  take  place  either  indirectly  by  means  of  an  intermediate  recording  in  a 
physical  medium,  or  by  local  electrical  connection  of  various  units  (such  as  input  and  output 
devices  and  computers)  or  by  means  of  data  transmission  equipment. 

2.2  The  implementation  of  this  coded  character  set  in  physical  media  and  for  transmission,  taking 
into  account  the  need  for  error  checking,  is  the  subject  of  other  ISO  publications. 


NOTES  ABOUT  TABLE  1: 


1)  The  format  effectors  are  intended  for  equipment  in  which  horizontal  and  vemcal  movements  are 
effected  separately.  If  equipment  requires  the  action  of  CARRIAGE  RETURN  to  be  combined 
with  a  vertical  movement,  the  format  effector  for  that  vertical  movement  may  be  used  to  effect  the 
combined  movement.  For  example,  if  NEW  LINE  (symbol  NL.  equivalent  to  CR-*-LF)  is 
required.  FE2  shall  be  used  to  represent  it.  This  substitution  requires  agreement  between  the 
sender  and  the  recipient  of  the  data. 

The  use  of  these  combined  functions  may  be  restricted  for  international  transmission  on  general 
switched  telecommunication  networks  (telegraph  and  telephone  networks) 

2)  The  symbols  TV  and  locations  2/3  and  2/4  are  used  respectively  to  denote  NUMBER  SIGN  and 
CURRENCY  SIGN.  Note  that  the  character  do  not  designate  the  currency  of  a  specific  country 
unless  otherwise  agreed  upon  between  the  sender  and  the  recipient  of  data. 

3)  These  positions  are  intended  for  national  use  or  for  alphabet  extension.  If  not  used  for  such 
purposes,  they  may  be  used  for  representing  sy  mbols  which  do  not  have  specific  functions.  This 
requires  agreement  between  the  sender  and  the  recipient  of  the  data. 

For  the  general  case  of  information  interchange  among  computers,  these  positions  shall  not  be 
used. 

4)  Positions  and  names  of  special  signs  which  have  specific  functions  in  the  code  :able  is  the  same  as  in 
ISO  646.  However,  such  signs  should  be  imaged  and  printed  according  to  text  as  shown  in  the 
following  Table. 
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a 
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PROGRAM  CODE 


Program  Lex ical_Translator ( i nput , output  >  ; 

(*  *************************  +  *****-*********-**--***'**-* *  ■*) 


File  Name 
Module  name 
Author 

Date  created 
Last  change 
Cal  1  s 

Open_F i 1 e 


Lex i cal . pas 

Lex i cal _Tr ansi ator 

Sadek  Saleh  AL-Juhaiman 

April  4,  1906 

Aug  4,  1986 


Fill  But  -Per  = 


0pen_File  =  Gets  the  source  -file  name,  and 
initialise  the  Output  tiles. 

Initialize  =  To  initialize  the  hash  table  and  global 
var i abl es . 

Fill_Butter  =  Fill  the  line  butter  and  increment  the 
line  no. 

Buf f er_Empty=  Check  it  the  line  butter  was  consumed. 

Token_And_Type  =  Get  the  next  token  and  its  type. 

Map_ I den_To_Lat i n=  Search  tor  the  identifier  in  the 

symbol  table.  If  not  predefined 
then  insert  it  . 

Lat i n_ I nteger  —  Map  integer  tokens  to  Latin  integers. 

Speci  al  _Charact.er=  Map  special  characters  to  Latin 

equivalent  character. 

Control _Char  =  Notifies  the  presence  of  escape 

codes. 


Cal  led  by 
I nc 1 ude  files 


None 

Resource . pas 


Var i abl es 
Line 

Nex  t_Loc 
Token 
T  ok_T ype 
Tok_Len 
Li ne_No 
Debug _0n 


Comment  On^ 


Res  Word  = 


Ma  t.  c:  h  _  I  n  d 
I  nt_Str 
Line 

Next  Lac 


Token 
Lat in_Id 
Hash 

Arabi cSpel 


Input  line  buffer. 

Points  at  the  first  char  of  next,  token. 
Buffer  of  255  character. 

Types  of  the  token  present  in  token  buffer. 
The  length  of  the  token  in  token  buffer. 
Source  code  line  number. 

Boolean  variable,  debugging  feature,  set 
by  Arabic  directive  in  the  source  code. 

=  Directive,  to  include  the  comments  in 
the  generated  output. 

=  Array  of  records  for  the  reserved  words, 
contains  the  Arabic  and  its  English  match, 

—  Index  in  Res_Word  array  to  token  location. 
~  Integer  string  of  size  10  characters. 

=  Input,  line  buffer. 

~  The  first  character  of  the  next  token  in 
the  line  buffer 

—  Token  buffer. 

=  The  mapped  identifier  (  in  Latin  form  ). 

=  HashTable; 

1  =  Spelling  string  array  of  5000  chars. 


<• .  * < ' > 
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characters  =  Number  of  chars  in  spelling  table. 
Line_No  =  Counts  the  read  source  lines. 

Lme_bi:e  ~  Line  buffer  upper  limit. 

Match_ Ind  -  Index  of  reserved  word  found  in  the 
constant  array. 

Iden_No  =  The  number  of  the  identifier  in  the 

sequence  of  arrival. 

Latin_Char  =  One  character  buffer  for  special 
characters. 

Lat_Int  =  The  integer  translation  to  Latin. 

Error_Set  =  Token  error  set.  * 

,r 

Comment  : 

The  program  will  ask  for  input  source  file  with  or 
without  extension  .  IF  the  name  is  valid  it  will  open 
the  file  and  initialize  tow  out  put  files.  The  two 
files  will  have  same  file  name  and  the  extensions  DIC 
and  PAS.  The  DIC  file  has  all  userdefined  identifiers 
with  their  assigned  Id_Numbers.  The  PAS  file  will  have 
the  generated  PASCAL  code. 

After  initialization  the  program  will  take  one 
line  and  break  it  to  tokens.  The  token  is  given  a 
type,  then  based  on  the  type  a  translation  module  will 
be  called. 

The  above  will  continue  for  each  line  of  code  until 
a  major  error  is  encountered.  Major  error  will  result 
from  long  tokens  when  using  comments  or  literal  string 

J 

(*  **************-********-******************-********  * 


CONST 


M 


Max_Arb_Word  =12; 
Max_Lat_Word  =12; 
Max_Len  =255; 

Res_Words  =59; 

Max  Key  =  6310; 


TYPE 


size  of  Arabic  word  1 
size  of  Latin  word  J 
line  ?■<  literal  size  ! 
reserved  words  size  1 
Prime  number,  hashing  1 
Size  of  spelling  table! 


Line_Range  =  O..Max_Len; 

Arab_Wor  d_St.r  =  st r  i  ng  II Max_Ar  b_Wor d  3; 

C  max  char  per  Latin  word  1 

L.atn_Wor  d_Str  =  stringC  Max_Lat_Wor  d  3  ; 

Word_Rec  =REC'0RD  C  constant  array  record 

•C  of  reserved  words 
English;  Latn_Word_Str ; 

Arabic  ;  Ar ab_Word_Str ; 

END; 


Reser ved_ Inde: 
Words  = 
Latin_Token  = 
WordPointer  = 
WordRecord  = 


=  1  . .  Res_Words; 

array  C  Reserved_ Index  3  OF  Word_Rec; 
string  C63;C  string  in  the  form  id_000  3 

•  WordRecord ;  C  Pointer  to  user  defined  id! 

RECORD  C  for  user  defined  iden.  ! 

Index,  C  identifier  number  sequence! 

Lenth,  C  Length  of  the  word  .  ! 


HashTab 1 e 
Spe'l  1  i  ngTab  1  e 
Ln_  Str¬ 
icken  Str 


LastChar : 

Next Word : 

Lat. i  n_Id : 
END; 

=  array  C  1 
=  array  C 1 


C  Location  of  the  word 
l  character  in  symbol 
i nteger ; 

•C  link  pointer  to  next 
WordPoi nter ; 

\  assigned  identifier 
Lat i n_Token ; 


Types  Of  _Tokerv 


1  ast- 
table. 

word 

number- 


array  Cl  ..  Max Key  3  OF  WordPoi nter; 
array  Cl  ..  Maxchar3  OF  char; 
str i ng  C  Max  _Len  3 ; 
string  C  Max  _Len  3  ; 

(  Long  _Token  ,  Long  Comment. , 

Long _Li ter al _Str ,  Illegal _Char! ; 

(  Ed.  anks  ,111  egal  ,  Reserved_Wor  d  , 

Li teral _Str ,  Contr 1 _Cod , Unc 1 s  f  d , 

I  dent  i  f  i  er  ,Coment  ,  j.ritegerl  , 

Furtct.  _  Op  era  tor  )  ; 


•C  Arabic  characters  ranqe 


"0 


■-.'''.y.'K-, 


.W-VA 


resource  file  contains  the 


CONST 

res  word:words  = 


(  <  eng  1 1 sh 

' absol ute  ' 

; ar ab l c 

'  ’  )  , 

( eng  1 i sh 

'and 

; ar ab  i  c 

l_J~3  <— ’i  '  >  , 

( engl i sh 

'  array ' 

; ar ab l c 

'  lj.3  '  )  , 

( eng  1 i sh 

'  beg i n ' 

; arab i c 

'  =1_.  |  )  , 

( engl i sh 

'  case ' 

; ar ab i c 

'<JU' j  , 

(engl l sh 

'  const ' 

; ar ab l c 

'  \_J _ (  1 _ J  )  ^ 

( engl l sh 

d  i  v  ' 

; arabi c 

'  <4_o_u.‘-a  '  )  , 

(engl i sh 

'  do  ' 

; ar ab i c 

'  J-xj  i  '  )  , 

( engl i sh 

'  downto ' 

; arab i c 

( eng  1 l sh 

' el se  ' 

; ar ab i c 

-  'Jij  '  >  , 

( eng  1  i  sh 

'  end 

;arabic 

"  )  , 

( engl l sh 

'  external 

; ar ab i c 

*-=■  ^  ? 

( engl l sh 

'  tex t ' 

; ar ab i c 

'  t-i-Lo  '  )  , 

( engl l sh 

' f  or war d 

; ar ab i c 

'  i>-  V  '  )  , 

(engl i sh 

'  f  or 

; arabi c 

'  J---V  i  , 

( eng  1 i sh 

'  f uncti on ' 

; ar ab i c 

'  <-*— ■ v-h  3  '  )  , 

(engl i sh 

'  goto ' 

; ar ab i c 

'  t_r-)  1  _  a  ]  ) 

( engl i sh 

'  concat ' 

; arab i c 

'  3  )  , 

( eng  1 i sh 

'  ini l ne ' 

;arabic 

'  _j.b.u,U  1 — j  1  )  , 

( engl i sh 

'it' 

; arabi c 

'  1  a]  '  )  , 

( eng 1 i sh 

'in' 

;  arabi  c: 

'  <>  IS-''), 

( engl l sh 

label 

; ar ab i c 

'  3  '  )  , 

(engl i sh 

'  mod  ' 

iarabic 

'  1 — j  )  , 

( engl l sh 

'nil' 

; ar ab i c 

'  ,'S  -'  V  '  )  , 

(engl l sh 

not 

; ar ab i c 

'  '  )  , 

( eng  1 i sh 

' over  1  ay  ' 

; arab i c 

'  UL-ui  '  )  , 

( eng  1 i sh 

of  ' 

; arab i c 

' J1 ' ) , 

( engl l sh 

'  or  ' 

; ar ab i c 

'  3  7  '  >  , 

(engl i sh 

'  pac led ' 

; ar ab i c 

,  , 

(engl l sh 

'  procedure ' 

; ar ab i c 

'  d-i— . >  _>b  '  )  ? 

(engl i sh 

'  pr ogr am ' 

; arab i c 

1 — >  _>->  '  )  7 

(engl l sh 

'record ' 

; arabic 

)  , 

(engl l sh 

'  repeat  ' 

;  ar  ab l c 

■sit  ■  )  ,  _ 

( engl i sh 

set 

;  arab  i  c 

^ — C  ^-5-vO  )  ^ 

( engl  .i  sh 

'  beg i n ’ 

; arab i c 

'  «__■  |  ,i--J  '  )  , 

(engl l sh 

'  sh  1 

jarabic 

'  _J  £  '  )  , 

( engl l sh 

'  real 

; ar ab l c 

'  i '  )  , 

( eng  1 l sh 

'  integer 

; arab i c 

'  t-:'—  '  >  , 

( engl i sh 

'  bool ean ' 

; arabic 

'  ,^.»Jb__._o  )  , 

( eng  1 i sh 

'  read 

; ar ab  i  c 

1  .>-»  1  '  f  , 

(engl l sh 

read 1 n ' 

; arabic 

'  j-b-u.-T  j_s  !  )  , 

(engl  i.  sh 

'  wr i te 

; arab l c 

'  >  , 

(engl l sh 

' wr i tel n 

; arab l c 

jb  v-  -4  1  '  )  i 

( engl i sh 

end 

; arab i c 

'  *-4_.  '  j  , 

( eng  1 i sh 

'  shr  ' 

jarabic 

,  ,  _  f  )  , 

(engl i sh 

string 

;  a  r  a  b  l  c: 

'  d—Lu.-J-u.*  '  )  , 

( eng  1 l sh 

'  then 

;  ar  ab  i  c 

i-d  )  , 

( eng  1 i sh 

'  type ' 

; arabi c 

'  1  >  h  '  1  , 

( engl l sh 

'  to  ' 

; arab l c 

C  5~  J  i  '  >  , 
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( engl 1 sh : 
( eng  1  1  sh : 
( engl i sh : 
( engl i sh : 
<  engl 1 sh : 
( engl 1 sh : 
( engl i sh : 
( engl i sh : 
( engl i sh : 
( engl 1 sh : 


'  unt i 1 
var  ' 

'  str 
chr  ' 

'  ord  ' 
'while' 

'  i nput ' 

'  output 
'with ' 


arable:  '  '  )  , 

;  ar  ab  l  c  :  '  ' 

arable:  '  p-i j_z" 
arabic:  '  ' 

ar  abi  c  :  '  ' 

ar  ab  i  c  :  '  > — ' 
ar  ab  l  c  :  '  J>;-  o  '  ) 
;  ar  ab  i  c :  '  ^  j_h-  ' 
arabi  c  :  '  £-■>  '  )  , 
ar  ab  l  c  :  '  « 1 


Arabic_Alph  :  Arbic_Alph  = 

[  £B0  ..  £B9,  C  Arabic  digit 

£D0  ..  £FD,  C  Arabic  letters 

•£96,  C  under  score 

£C0  ];  C  tail  genration 

Delimiters  :  SET  GF  char  =  {const  set,  delimiters 


#£8E, 
tt£8F  , 
#$90, 
#£91  , 
#£93, 
#£20 , 
#£94  , 
#£95  , 
#£9  7  , 
#£A0 , 
#£A3, 
#£A6  , 
tt£A7, 
#£AS , 
#£A9 , 
#£AB, 
#£AC  , 
#£AD  , 
#£AE  , 

#£DA, 
#£BC  , 
#£BD , 
#£BE  ] ; 


Space 

BCGN  -function  code 

BCQN  -function  code 

BCGN  -function  code 

BCGN  -function  code 

Array  le-ft  square  bracket 

Latin  space 

Array  right  square  bracket 
Arabic  up  arrow  "pointer" 
Arabic  reverse  apostrophe 
Arabic  Space 
Arabic  multiply 
Arabic  period 
Arabi c  divide 
Arabic  left  parenthesis 
Arabic  right  parenthesis 
Arabic  plus  sign 
ARABIC  comma 
Arabic  minus 
numeric  comma  used  as 
the  Latin  decimal  dgt 
Arabic  colon 
Arabic  greater  than 
Arabic  equal  sign 
Arabic  less  than 


VAR 

Debug_On 

Comment_Gn 

Tok_T  ype 

Tok_Len 

Int_Str 

I 

Line 

Next_Loc 
token 
La t i n_ I d 
Hash 

Arabi cSpel 1 
Characters 
Li ne_No 
Li ne_Si ze 
Iden_No 
Match_ I nd 
Lat 1 n_Char 
L.at._Int 
Error _Set 
□utF  lie 
InFi  le 
Dictionary 


bool ean ; 
bool ean ; 
Types_Of_Token; 
Li ne_Range; 
string! 10]; 
i nteger ; 

1 n_Str ; 

Li ne_Range; 
token_Str ; 

Lat i n_Token ; 
HashTab 1 e ; 

Spel 1 i ngTab 1 e ; 
integer  ; 
i nteqer ; 

Li ne_Range ; 

000  . .  999; 
Reser ved_ I ndex ; 
char ; 
str 10; 

SET  OF  errors; 
tex  t ; 
text ; 
tex  t ; 


Procedure 
0PEN_  FILE; 

VAR 

vail  id  :  bool  ean;  C  -far  1/0  error  W/  -file  name  ] 

F_Name,  C  -file  name  with  no  extension] 

File_Name  :  string!  12] ;  {  tile  name  -from  key  board.  } 

ind  :  integer; 

BEGIN 

val id  : =  tal se; 

WRITELN  (' Input _F  i  1  e  name:'); 

REPEAT  C  until  valid  -file  name  ] 

READln  (FileName); 

ASSIGN  (InFile,File_Name) ; 

C-fl-]  C  i  -f  no  error  opening  file] 

RESET  (  I  nf  i  1  e )  ;  C  then  file  exist  ]■ 

C^I+]  C  if  no  I/O  error,  its  valid 

valid  (  IOresult  =  O) ; 

C 1  r  S  c  r  ; 

if  not  (valid)  THEN 
BEGIN 

WRITELN  (  **  FAILURE  TO  OPEN  FILE  ==-- 

F l 1 e_Name  ) ; 

WRITELN(  Please  RE_ENTFR  Input.__File  name 

END; 

UNTIL  VALID; 


REPEAT  C  get  the  name  W/0  extension 

F_Name ( . ind. )  :=  F 1 1 e_Name ( . i hd . ) ; 

ind  :=ind  +  1; 

UNTIL  ( Fi 1 e_Name ( . l nd . ) = '  ')  OR 

( F  i  1  e_Name  (.ind. )='.")  OF: 

(  ind  ■  LENGTH  <File_Name)  ); 

F _ Name ( . O . )  : =  CHR ( i nd - 1 ) ; 

ASSIGN  ( out + l 1 e , F_name+ ' . pas ' ) ;  i  translator ■ output 
ASSIGN  ( di c t i onar y , F_Name+ ' . di c ' ) ;  C  dictionary  file 
RESET  (infile); 

REWRITE  (out-fi  1  e)  ; 

REWRITE (dictionary) ;  I  tile  contains  identifiers 
D;  C  and  their  translations 


Procedure 

INITIALIZE;  C  Initialize  the  hash  keys 

VAR  C  and  the  global  variables 

KeyNo  :  integer; 

BEGIN 

Debug_On  :=  false; 

Comment_Qn :  =  false; 

Error _Set : = C 1 ; 

Line_No  : =  0; 

Idc?n_No  :=  0  ; 

KeyNo  : =  1  ; 

WHILE  KeyNo  <=  Mas-: Key  DO 
BEGIN 

hash  (.  KeyNo  .)  *:=  nil; 

KeyNo  : =  KeyNo  +  1  ; 

END;  ’ 

characters  :=  0  ;  I  count  of  chars  in  spell  tbl 

END ;  C  initial iz  J 

PROCEDURE 
F I LL_BUFFER 

(  VAR  line  :  ln_Str;  C  input  line  buffer 

VAR  where  :  line_range;C  location  in  buffer 

VAR  line_no  :  integer; 

VAR  Ln_S3ize  :  line_range 

)  ; 

BEGIN 

READLN  ( i  rtf  i  1  e  ,  1  i  ne)  ; 

L.1  ne_No  :=  Line_No  +  1; 

IF  Debug_0n  THEN  WRITELN < 1 i ne)  ; 

IF  (line=  'C+b^Lol'  )  THEN  C  set  comment  directive 
BEGIN 

Comment_0n:=  true; 

READLN (infile,li ne) ; 

1 i ne_No  :=  Line_Mo  +1  ; 

END; 

IF  line  =  '  '  THEN 

BEGIN  C  reset  comment  directive 

Comment^ On : =  false; 

READLN (inf  lie,  line)  ; 
line_No  : =  Line_No  +1  ; 

END; 

IF  line  =  '  I  +  '' 1 '  THEN  C  set  debug  directive 

BEGIN 

Debug  On  : =  true  ; 

READLN (infile, line) ; 
line_No  :=  Line_No  +1  ; 

END; 

IF  line  ~  '  {  1  '  THEN  C  reset  debug  directive 

BEGIN 


FUNCTION 
BUFFER  _EMF' TV 

(  Next_Loc  :  line_range;‘ 

LnSine  :  1 1 ne_range 

): BOOLEAN; 

BEGIN  C  check  i f  butler  is  empty 

BUFFER _EMF'TY  :=  (  next_loc  >  Ln_Size>; 

END; 

FUNCTION 

EliPTY_ERROR_SET :  BOOLEAN; 

C  *************************-****-****-**-*********-****  j 
C  I-f  error  set  is  empty  then  no  errors  are  found 

yet.  translation  will  continue  1 

[  ****•******'**************-***********■•*•**•*•**•**■****  } 


BEGIN 

EMPTY 

END; 


ERROR  SET 


(ERROR  SET  =  CD 


Procedure 
TOKEN_AND_TYPE 
(  VAR  where 
VAR  token 
VAR  Tok  Len 


:  1  i  ne_range;  •[  location  of  next  token 

: token _3tr ; 

:line_range;  I  length  of  resulted  token 


VAR  Tok__Type  ; Types_Of _Token ;  ■{'.  Token  type 

VAR  Match  Ind: reserved  index!  index  of  res.  words 


) ; 


*  *  •*  *  *  *  *  -*  •*  *  *  -*  x  - 


module  name 
date  created 
cal  Is 


TOKEN_AND_TYPE 
April  7,  1 936 

Blanks,  Comments,  Li teral _Str i ng 
I nteger_Tok ,  I dent i f i er _Tok , 
Reserved_Tok ,  Speci al _Char , 
Control  Char 


MA I N 

Aug  5,  1986 


cal  led  by 
var ] ab 1 es 
last  change 
C  om  merit 

procedure  collects  the  tokens  and  assigned 
Tok: tan  Type  names  to  them. 

r  *..<  .<■  t  *  d-  x  X  X-  *  *  »•  X  *  X-  *  -X-  >■  -X-  *  +  *  -X-  -*■  *  X  *  Jr  -X-  X  X-  *  X  X  -X-  *  *  '»  X  X  »  X  X  X-  *  »  J 

VAR  index  : integer;  C  For  token  indexing 

ch  :  char;  C  special  characters  token 

CONST  digits  :  SET  OF  char  =  kktf  BO  ..  #1D9  1; 
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V.  .v  .' 


'jc  (Jv 


.’V 

III.  U  ' 


IVv  Aj*  r 


Ak' »>. 


fe.  x A  xV  x  J 


Procedure 

COMMENT; 

£  ###****t#***#*-#**##****#*******************»*******  } 

C  Procedure  comment  will  assign  the  matching  Latin 
brackets  and  the  body  of  the  comment  to  the  token. 

The  token  type  then  set  to  Comment.  i 

c  **************  ********  ***********  ***********  ********  y 


BEGIN 

token!  1  ]  :=  -I  assign  the  opening  bracket 

token [23  :='*';  C  and  asterisk  to  token 

index  :=  2  ;  {  start  of  comment  body 

where  :=  where  +  2; 

REF'EAT  C  assign  body  of  comment 

i  ndei: :  =  index  +  1;  I  pointer  of  token  buffer 

token  [index 3:=line  [where]; 

where  :=  where  +1;  C  pointer  of  line  buffer 

UNTIL  (  (ORD  (line  [where  3  )  -•  f  AA  )  AND 

(□RD  ( 1 i ne I wher e+ 1 3 )  =  TAG  ))  OR 
(where  >=  Line_Sice  ); 

IF  (where  >=  Line_Sice)  THEN 

BEGIN  C  The  end  of  line  is  reached 

Tok_Type  ;  =  Illegal  ;  -C  before  closing  the  comment 
Error_Set: =  Error_Set  +  CLong_Comment 3 ; 

END 

El SE  C  the  comment  is  valid 

BEGIN 

token  [  i  ndex+1  3  j  =  '*';•[  assign  the  closing  bracket 

token [ i ndex *2  3  :=  ')  '; 

To k_ Type  :=  coment; 

where  :=  where  +  2;  C  advance?  line  pointer 
Tok_Len  : =  index +2  ;  C  advance  token  pointer 
token [03  :  =  chr (Tok_Len) ;  C  set  token  length 

END; 

END;  C  COMMENT  i 


PROCEDURE 
LI TERAL_STR ING; 

C  ****************************************************** 

r 

i 

Literal  string  will  look  -for  single  and  double  quotes. 
Matching  the  quote  character  at  the  beginning  and  the 
end  of  the  string.  Then  assigning  the  Latin  quotation 
marks. 

C  ******************************************************* 


BEGIN 

index :=  0; 

CASE  ORD  ( 1  i  ne  C  wher  e  3  )  of 


•C  if  bu-f-fer  points  at  ; 


$97 


REPEAT  C  single  quotes 

index  :=  index  +1; 
token [ i ndex 3  :=  line! where  3  ; 
where  :=  where  +  1; 

UNTIL  (ORD  < 1 i net where 3 )  =  $97  )  OR 
(where  >  line_size)  ; 


$AI 


REPEAT  -C  double  quotes 

index  :=  index  +1; 
token C index  3  : =  line [where  3  ; 

where  :=  where  +  1; 

UNTIL  (ORD (1 ineCwhere3 )  =  $A2  )  OR 
(where  >  line_size); 

END;  C  CASE  > 

C  if  literal  ended  with 
•C  the  right  quote  mark; 

IF  (ORD  ( 1  me  (.  where.  ))  =  $A2)  OR 
(0RD(line(. where. ) )=  $97)  THEN 
BEGIN 

index  :=  index  +1;  C  advance  pointer  for  the 

Tok_Len  index;  t  quote  mark.  Set  length. 

Tok_Type  :=  Literal_Str; 
token [03  :=  chr ( Tok_Len ) ; 


■C  for  single  quote  literal 
IF  (ORD (token  [13)  =  $97)  THEN 

BEGIN  C  assign  single  quotes 

token  [13  :=chr<$27); 

token  [index  3  :=  chr ($27) ; 

END; 


IF  (ORD (line  [where! )=  $A2  ) THEN 

BEGIN  •[  assign  double  quotes 

token  [13  :=chr($22); 

token  [index!  :=  chr ($22); 


END; 

where  : =  where  +  1 
END 


C  point  to  the  next  token 
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ELSE 

BEGIN 

Error _Set 
Tok_Type 
Tok_Len 
token  C  0  3 


C  i -f  line  pointer  did  not 
•C  single/double  qoute=  err 
=Error_Set  +  C Long_Li ter al _Str 3 ; 

=  illegal; 

=  index; 

=  chr (index ) ;  C  set  length  of  token 


PROCEDURE 
I NTEBERTOK ; 

C  ***************************************************  > 

■C  The  procedure  will  return  the  Digits  ranging 
from  BO  ..  B9  He::. 

j 

■C  **#**#**■*■*#*■****##*■#**■*#*■*■**-*■**#*■*■*##■**#■***-**#**#**  j 

BEG  I N 

index  :=  0; 

WHILE  (  line i .  where.)  in  digits  )  DO 
BEGIN; 

index  :=  index  +  1  ; 

token ( .index. )  :=  1 1 ne (.  where  .)  ; 

where  : =  where  +1; 

END ; 

Tok_Type  :=  integerl; 

Tok_Len  i-  index; 
tokenCOI  :=  chr (index); 

END; 

Pr ocedur e 
I  DENT  I F I ER_TQK ; 

■C  *****************************  *********•***•***•**•*■*-**-•**■•**-*•*-*•*• 
■C  The  procedure  will  look  for  any  number  of  digits  and 
underscore  characters  following  the  first  letter. 


•C  *•*****•**********************••**#•*•*•*■#*•*•#*•#***#•***#•**#*##*•*■*■* 

VAR  valid:  boolean;  -  — 

BEGIN 

index :=  0; 

REPEAT 

index:=  index  +  1; 

token ( . i ndex . )  : =  1 i ne ( . where.  )  ; 

where:-  where+1 ; 

UNTIL  not (  ORD ( 1 i ne (. where .) )  in  Arabic_alph  ); 

Tok_Type:=  Identifier; 

Tok_Len  :=  index; 

, tokenCOI:®  chr  (index); 

END; 
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Procedure 

RESERVED_TOK 

(  VAR  match_i  ndex  :  r eserved_i ndex ) ; 

C  ************************ ************************** 

■C  If  the  TOKEN  is  reserved  word.  The  procedure  will 
set  the  token  type  to  Reserved_Tok  and  pass  the 
i  nde:-:  of  the  word.  In  the  constant  array. 

■C  *************************************************** 

VAR  index:  integer; 

hit  :  boolean;  C  when  a  match  is  found 

BEG  I N 

hit  : =  false; 
i ndex  : =  1 ; 

WHILE  (index  <=  res_words  >  AND  (  not (hit))  DO 
BEGIN 

IF (  token  =  res  word ( .index . ) .Arabic)  THEN 


BEGIN 

jT 

the  token  match  with 

h 

j 

hit  true; 

.f 

l 

reserved  word 

J 

match  index 

: =i ndex ; 

END; 

index : =  index 

+  1  ; 

END;  C  while  no 
IF  hit  THEN 

hit  > 

•Ci 

f  token  is  reserved 

word 

Tok_Type  :=  Reserved_word ;  C  set  the  token  type 

END; 
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Procedure 

SPEC  I AL_CHAR_TOK ; 

C  *****************■******•#•-*•*••**-*■*-*■****■#■***■■*■**-*-*■*■*■*■*****} 

{  The  procedure  gets  al 1  the  the  tokens  of  one  char 

other  than  the  escape  codes.  } 

£  ******■********•********■*****-**•**■*•**■*-*•*•*■*■*-*-#■*■**••*-■#■**•*•*■*  j 
var  1 1 1 egal _Char s  : set  of  $21..$FF; 


BEGIN 

Illegal  Chars: 


C$21  . . $7E , C  Latin  chars 
$81..$3D,C  numeric  characters, 
$92,  C  Arabic  ©  character 


Arab i c 


IF  ord (line 
BEG  I  N 

Tok_T  ype :  = 
er  r  or  _set :  = 
END 
ELSE 
BEGIN 

token  C 1 D  :  =  1 
where  : 

Tok_Len  : 
token ( . 0. ) : 
Tok_Type  : 
END; 

END; 


$92,  C  Arabic  ©  character 

$97, $99 , 

$9B..$9F,C  non  used  characters 
$A 1 . . $A2, 

$A4 . . $A5 , 

$AA , $AF , 

$BF  , 

$C0..$CF  3;  C  Arabic  diacritic; 
(.where.))  in  1 1 1 egal _chars  THEN 
C  Latin  characters 

illegal; 

error_set  +  C 1 1 1 egal _Char 1 ; 


i ne  [where];  C  one  character  special  char 
=  where  +  1;  C  advance  line  pointer 
=  l ; 

=  chr(l);  C  set  token  length  to  one 

=  Funct_0perator  C  set  tokne  type 


F'rocedure 
CQNTR0L_CHAR3 ; 

C  ******************  x  **************************-*  **-**  *  *  *  *■  *  *■«  s 
C  control  characters  are  used  by  ECON  and  will  be  omitted. 


BEGIN 

tokenCl]  :=  1 inetwhere] ; 

Tak_Type  :=  contrl_cod; 

Tok_Len  :=  1; 
if  Debug  On  THEN 
BEGIN 

WRITELN("  Control  character  ( '  , ORD ( 1 i net where  3 )  , 
')  in  source  code'); 

WRITELN('  IN  Line  Number  Line_No, 

Location  -  where  ); 

END; 

where  :  =  where  +  1; 

END; 


BEGIN; 


C  TOKEN  AND  TYPE 


C  Based  on  the  first  character  of  the  token  call  an 

appropriate  module  to  collect  the  token  and  set  the  t 


To k_ Type  :=  unclsfd  ;  C  initialize  token  type 

I F  (ORD  ( 1  i  neC where]  )  =  #A9)AND  C  #A9  openings  bracket 
(QRD ( 1 i net where+1 ] ) =#AA) THEN  C  #AA  is  asterisk 

COMMENT;  ■[  call  procedure  Comment 

IF  Tok_Type  <>  content  THENC  if  not  comment  THEN  based 
CASE  ORD  ( 1  i  neCwherel )  OF  {  on  -first  char  get  the  type 

#A0 , #20  :  BLANK;  C  leading  space  (s) 

#A2 , #97  :  LITERAL_STRING; 

#B0..#B9  :  INTEGER_T0K;  C  get  integer  token 
#D0..#FD  :  BEGIN  C  leading  letter 

IDENTIFIER_TOK;  C  is  it  user  defined/ 

r eser  ved 1 

RESERVED_TOK(match_ind)  ; 

END; 


#80 , #3E , 
#8F , #90 , 

#9 1  : 

ELSE 

END;  C  case  1 
END; 


CGNTR0L_ CHARS ;  C  control  characters 

SPEC I AL_CHAR_T0K ; 


1*4'  rl1* 


mm 


mm, 


Procedure 

MAF'_  I  DEN_TO_LAT  I N 

(  token  :  Token 


1  enth 

VAR  Latin  Id 


i nteger ; 

Lat i n  Token  ) ; 


*************************************************** 


module  name 
date  created 
calls 
cal  led  by 
var i ab 1 es 

token  : 
lenth  = 


Map_Iden_To_Lati n 
:  Apr i 1  30 , 1 986 
SEARCH 
MAIN 

scanned  identifier  token, 
length  of  scanned  identifier 


Latin  Id  =  the  translated  identifier 


L.  a  t  i  n 


last  change 


Aug  2,  1 986 


Comment 

The  Procedure  will  look  up  an  Arabic  identifier  if  not 
in  the  list  it  will  insert  the  Arabic  token  in  the  list. 
The  token  will  be  assigned  a  Latin  label  for  the  use  of 
the  PASCAL  compiler.  The  meaningless  label  will  have  the 
form  of  Id_###  .  Where  the  is  an  integer. 

Note:  code  segments  of  this  module  is  taken  from 


"PR INCH  HANSEN  ON  PASCAL  COMPILERS"  1985 
see  thesis  references 

*  *  *****  **  *************************  ***************** 


-a/"  •’ 

*  w  m  « 


Function  Hash  _Key  t  return  the  hash  key  of 

(  token : token_Str ;  C  the  identifiers. 

lenth:  lme_range 
) : i nteger ; 

CONST  W  =  32513;  C  3276B  -  255,  overflow  chel 

N  =  Mar  Key;  C  F'r  i  me  number  for  words  si  2 

VAR  sum , 1  : integer;  C  sum  is  the  token  ord.  valu 

BEGIN 

sum  : =  0; 
i  :  =  1  ; 

WHILE  1  <=  lenth  DO 
BEGIN 

sum  :=  (sum  +0RD ( token ( . I . >  ))  MOD  W; 

i  :  =  1  +  1  ; 

END; 

Hash_Key:=  (sum  MOD  N  )  +  1; 

END; 

Procedure  INSERT 
(  token  :token_Str; 

1 enth : 1 1 ne_range; 
index  : integer; 
f eyNo  : 1 nteger 

)  ; 

VAR  m , n  :  integer; 

pointer  :  wordpointer; 

temp  :  Latin_token;  — 

F  Rt  ICE  DURE 

I  NO  VAR  Latin_id  :  Lat  i  n  _token  )  ; 

VAR 

■  t .  M F  :  string  C  3 1 ; 

Ff  3  I  N 

i ASF  I DLN_NG  OF 

>'•  ..9  :  BEGIN 

SIR ( I den  _no :  1 , TEMP)  ; 

Lat 1 n_i d  :=  CONCAT (  ’ i d_  '  ,  TEMP); 
END; 

10.. 99  :  BEGIN 

STR ( Iden_No: 2, TEMP) ; 

Latin  _id  ;=  CONCAT  ('  i  d_„ TEMP)  ; 

END; 

1  On.. 999  :  BEGIN 

STR ( IdenNo: 3, TEMP) ; 

Latin  id: «  CONCAT  (  ' id  _ ' , TEMP) ; 

END; 


BEGIN 


C  insert  Identi-fier  in 
C  spelling  table 
characters  :=  characters  +  lenth; 
m  : =  lenth; 
n  : =  characters  -  m; 

WHILE  (m  >  0)  DO 
BEGIN 

Ar  ab i cSpel  1  (  .  m+n  . ) : =  token  ( . m .  )  ; 
in :  =  m  -  1 ; 

END; 

I D_N0  (  temp); 

NEW ( poi nter ) ;  I  Insert  word  record  into 

poi nter"'".  Lat i n_I d  :=  temp; 
pointer"'.Ne::tWord  :  =  Hash  < .  KeyNo.  )  ; 
pointer"".  Index  :=  index; 

pointer"',  lenth  :  =  lenth; 

poi  nter  '" .  1  astchar  :=  characters  ; 

WR I TELN (d i ct i onary ,  ", 

poi  nter"" .  Lat  i  n_Id  ,  '  '  , token)  ; 

Hash (. KeyNo. )  :=  pointer; 

END; 


FUNCTION 

FOUND 

(  token  :  token_Str; 
lenth  :  integer; 
pointer:  WordPointer 
) :  bool ean ; 

VAR  same  :  boolean; 

m,n  :  integer; 

BEGIN 

IF  Poi  nter"".  1  enth  <>  lenth  THEN 
same  :  =  -f  a  1  se 
ELSE 
BEGIN 

same  :=  true; 
m  :=  lenth; 

n  :=  poi  nter"'".  1  astchar  -  m; 

WHILE  same  AND  <m  0  )  DO 
BEGIN 

same  : =  token!. m. )  =  ArabicSpel 1 ( . m+n. ) ; 
m  :  =  m  —  1  ; 

END; 

END; 

FOUND  :=  same; 

END ; 
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Procedure 

Search 

(  token  :  token_Str; 

lenth  :  integer;  {  token  length  } 

VAR  Latin_Id  :  Latin_token  I  returned  Latin  token] 

)  ; 

C  **************************************************  ] 
C  Comment: 

The  module  will  call  -function  Hash_Key  to  get  the 
token  key  and  then  look  the  key  up  in  a  hash  table. 

The  hash  table  content  is  pointers,  pointing  at 
word  records.  The  records  has  the  length  of  token  , 
location  in  symbol  table,  Latin  Identifier  number , 
the  next  word  in  the  linked  list'., 

IF  the  pointer  resulted  -from  the  Key  number  is  nil, 
that  means  the  word  is  not  in  the  table.  That  means  the 
word  must  be  inserted  i -f  there  is  room  in  the  spelling 
table.  Insertion  is  made  by  procedure  INSERT.  If  the 
pointer  is  pointing  at  a  record,  or  linked  list  of 
records,  function 

FOUND  is  called  to  verify  the  spelling. 

J 

1  ***************************************************  ] 


VAR  KeyNo  :  integer;  C  global  variables  for  SEARCH} 

done  :  boolean; 

Pointer  :  wordpointer; 


BEG  I N  C  SEARCH 

KeyNo  :=  Hash_Key  ( token  ,  1  ent.h )  ; 
poi nter  : =  hash ( . KeyNo.  )  ; 
done?  :=  false?; 

WHILE  not (done)  DO 

C  insert  new  id.  if  size  and 
IF  (  pointer  =  nil  )  THEN C  and  number  within  limits 
BEGIN  C  add  identifier 

Iden_No  :=  Irien_No  +  1  ; 

I NSERT  ( token , 1 enth , I den_No , KeyNo ) ; 

Latin  Id  :=  hash (. keyNo. ). Lati n_ Id; 
done  true; 

END 

ELSE  IF  FOUND  ( token  ,  Tok_L.en  ,  poi  nter  )  THEN 
BEGIN 

l.atin_Id  :=  pointer  \  Latin_Id; 
done  :=true; 


BEGIN;  C  Map_ I den_To_Lat 1 n 

SEARCH  (Token,  Tok_Len,  Latin_Id); 

END;  C  NAP -IDENTIFIER- TO-LATIN  3 


PROCEDURE 

GET_LAT I N_SPEC_CHAR 

(  token  :token_Str  ; 

VAR  Latin  char  :c:har 


)  ; 

VAR  Arb_char  :  str  i  ng  C  1 II ; 
BEGIN 


Ar b_Char 

: = token  ( 

.  1  .  )  ; 

CASE  ORD  (  Arb 

_Char 

) 

OF 

*BC 

:  Latin_ 

char 

= 

$BE 

:  Latin_ 

char 

— 

' 

T  9  3 

:  Latin_ 

char 

= 

'  1 

*94 

:  Lat i n  _ 

char 

'  c 

f  AS 

:  Latin_ 

char 

= 

) 

*A9 

:  Latin_ 

char 

- 

'  ( 

TAB 

:  Latin_ 

char 

'  + 

f  AD 

:  Lat i n 

char 

= 

*A7 

:  Latin_ 

char 

= 

/ 

*96 

:  Lat.  in_ 

char 

= 

' 

*A3 

:  Latin_ 

char 

- 

'  * 

*BA 

:  Latin_ 

char 

= 

: 

*BD 

:  Latin_ 

char 

= 

'  zs 

*AE 

:  L.atin_ 

char- 

= 

:*95 

:  L.atin_ 

char 

= 

'  .•••. 

*  A6 

:  Latin_ 

char 

= 

. 

IBB 

:  Latin_ 

char 

= 

*  AC 

:  Latin_ 

char 

= 

7 

-  jr 

!» 


Arabic  greater  than 
Araqbic  less  than 
Arabic  square  bracket 

Arabic  RIGHT  parenthesi 
=====  LEFT  =  =  =  == 

Arabic  Plus 
Mi  nus 
Divide 
Under_Score 
Mul t i p 1 y 
Col  on 
Equal 

Numeric  comma 
Hat 

Per l od 
Semi  colon 
Comma 


END 


END; 

Procedure 
LAT I N_ I  NT 

(  token  :  token_St.r; 

Tok:_Len  :  1  1  ne_r  anqe  ; 

VAR  Lat_ I nt : Str 1 0  ); 

VAR  ind  :  integer; 

BEGIN  C  tor  each  digit  map  to  i 

C  Latin  digit 

for  ind:=  1  to  Tok_Len  DO 
CASE  ORD (token ( . i nd . ) )  of 

$B0  :  Lat_  I  nt  ( .  i  nd  .  )  :=  'O'; 

■£B  1  :  Lat_Int(.ind.)  :=  '  1  ; 

TB2  ;  Lat_Int  (. ind. )  :  =  '2'; 

TB3  :  Lat_Int  (.  ind.  )  :=  '3'; 

'TB4  :  Lat_Int  (.  ind.  )  :=  '4'; 

*■  B  5  :  Lat_Int  ( .  ind  .  )  :=  "5'; 

fB6  :  Lat_Int ( . ind. )  :=  '6'; 

TB7  ;  L.at__Int  (.  ind.  )  :=  '7'; 

■f  B8  :  Lat_Int  ( .  i  nd  .  )  :=  '8'; 

TB9  :  Lat_Int  ( . ind. )  :=  '9'; 

END; 

Lat._  Int  ( .  0.  )  :=  token  (.0.);  C  set  length  of  token  ; 

END; 

PROCEDURE 

PR  I NT_ERROR_MESSAGES ; 
var  ind  :  integer; 

BEGIN 

WRITELN  ('***  ERROR  ON  LINE  NO.  ',line_no); 

for  ind  1  to  line_si.ce  do  write  (  line  (.ind.)  ); 

WRITELN; 

IF  long_token  IN  error_set  THEN 

WRITELNi'  has  long  tok:en  *■*•*',  token); 

C  IF  1  ong_coiTiment  IN  error_set  THEN 

WRITELN  (  has  long  comment*** token ) ;  1 

IF  1 ong_l i teral _Str  in  error _set  THEN 
WR I TELN ( '  UNCLOSED  QUOTES  ); 

IF  II  legal  _Char  IN  error_set.  THEN 

WRITELN  (  '==========  Character  number  '  „  Nei:  t  Lor:  . 

is  out  of  range==:::======  '  )  ; 


main 


BEGIN 

OPEN_FILE; 

INITIALIZE; 

While  not (eof (inf ile) )  AND (  error_set  =  CD)  DO 
BEGIN  I  Line  process 

F  I  LL_EiUFFER  (1  1  ne  ,  ne;:  t  _1  oc  ,  1  i  ne_no ,  1  ine_5i:e)  ; 

WHILE  not  (  BLJFFER_EMPTY  (  ne;: t_l  oc  ,  1  l  ne_si  z e )  )  AND 

(  err or _set  =  CD)  DO 

BEGIN  C  Taken  process 

TOKEN  _AND_TYF'E  ( ne;;  t_l  oc  ,  token  ,  Tok_Len  , 

Tok_Type , Natch_Ind )  ; 

IF  Debug_On  THEN 

WRITELN (' token  -  token 1 enght= 

Tok_Len,  '  Next_Loc  =  '  , Next_Loc)  ; 

C  A  S  F.  T  o  k  _  T  y  p  e  o  f 

blanks  :  FOR  i  :=  1  to  Tok_L.en  DO 

wr ite( out f i 1 e ,  '  )  ; 

coment  :  IF  Comment_On  THEN 

wr ite( out  T i 1 e , token ) ; 
literal_Str  :  wr  i  te  (out-F  i  1  e  ,  token  )  ; 
reserved_word  :  wr  i  te  (OUTFIL.E  , 

r es_word  (  .  matr:h_i  nd  .  )  .  Engl  i  =  h  )  ; 
identifier  :  IF  (Iden_No  <  1000)  AND 

(characters  <  Nir; Char  )  THEN 
BEGIN 

MAF_ I DEN_T0_  LAT I N (token , T  ok_Len  ,  L 
write  (outf  1 1  e  , Lat. i n__ Id  )  ; 

END; 

l ntsger 1  :  BEG  IN 

LAT I N_ I  NT (token , Tok_Len , Lat  _ I n t )  ; 
wr i t  e ( ou  t i i 1 e , La t  _ I n t )  ; 

END; 

■f  unc  t  _op  era  t  or  :  BEG  I N 

GET_LAT  I N_SF’EC_CHAR  (token  ,  Lat  i  n  c 
wr  i  te  ( out  -file,  Lat  i  n_char  )  ; 

END; 

contrl  _cod  :  WRITELN  (' 1  i  ne  '  ,lme_no:4  , 

Control  code  was  ignored 

illegal  :  BEGIN 

PRINT_ERRDR_ME5SAGES  ; 

END; 

END:  C  CASE  D 

END;  C  WHILE  TOKEN  J 
WRI  IELN  '.out  f  i  lei  ; 

END; 

IF  error,  set  CD  THEN  WRITELN (  'error  on  token  type'); 
CLOSE (  out  file); 
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Source  Code 


i  d_  1 

1  1 

£.  ^ - O.'J. 

i  d_2 

i  d_3 

v_f  Li? | 

i  d_4 

1 

i  d_5 

1 

id  6 
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i  d_8 

Test  Run  1  Dictionary  Table 


program  id_l; 

const  i d_2  =  32  ; 

type  id_3  =  record 

id_4  :  string  [30]  ; 
id_5  :  integer  ; 

id__6  :  boolean  ; 
end  ; 

var  id_7  :  array  C  1..  id_23  o-f  id_3  ; 

id_8  :  integer; 

begin 

while  (  i d_8  <  32)  do 
beg  i  n 

i d_8  : =  i d_8  +  1  ; 

read  (  i  d  _7  [  i  d_8 1 .  i  d_4  ,  i  d__7  C  i  d_8  J .  i  d_5)  ; 
write  (  i d_7 C l d_8 1 . i d_4 , i d_7 [ i d_3] . i d_5 )  ; 
end ; 
end . 


Generated  Code 


153 


Test  Run  2 


program  i d_ 1 ; 

const  id_2  =  32 
type  id_3  =  record 


i  d_4 
i  d_5 
i  d_&  : 
end  ; 
array  C  1 
:  integer 


:  string  C  30 3  ; 

integer  ; 
bool ean  ; 


32)  do 


var  id_7  :  array  C  1..  id_23  o-f  id_3  ; 

id_8  :  integer; 

begin 

whilei  id_8  <  32)  do 
beg  i  n 

id_8  :=  id_8  +  1  ; 

read  (  i d_7C i d_S3 . i d_4 , i d_7[ i d_83 . i d_5) ; 
write  (  id_7Hid_83. i d_4 , i d_7 C i d_83 . id_5) 
end ; 
end . 


Generated  Code 


Test  Run  3 
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Source  Code 


wvm 


Test  Run  3 


program  i d_  1  (input, output ) ; 
const  id_2  =  15; 
var  id_3  :  stringCSOH; 

ib_4:  str i ng C 1 23 ; 
i d_5  :  real ; 


beg  i  n 

id„5:=  1-22.7  *  id_4  ; 
i  d_3:  —  '  0  I  J I  J 

i  d_4:  =  "  £YT\  *oT  '  j 
concat  (  id_3,id_4); 
writeln  <id_3,id_4); 


end . 


Generated  Code 


i  d_  1 
i  d_2 
i  d_3 
i  d_4 
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Proceedings  of  the  International  Symposium  for 
Standardization  of  Codes.  Character  Sets  and  Keyboards 
for  the  Arab  Language  in  Computers.  1-4  June  1980  in 
Riyadh,  Saudi  Arabia,  Saudi  Arabian  Standard 
Organization,  1984. 

BOON  Programmer's  Manual.  Arabic-Latin  Information 
Systems,  Inc.,  Montreal,  Canada,  1985. 


Hansen,  Per  B.  ,  Brinch  Hansen  on  PASCAL  Compilers. 
Prentice-Hall,  Inc.,  Englewood  Cliffs,  New  Jersey,  1985. 
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